import duckdbHome Credit Default Risk
Part 1 | Exploratory Data Analysis | IS 6812
Introduction
The Project
Business Problem
Home Credit cares deeply about the population without sufficient credit history and aims to improve inclusion for this underrepresented group by designing prediction models for loan repayment, such that capable borrowers are not denied solely due to absence of credit while protecting against defaults.
Benefit of a Solution
A more accurate prediction model will help Home Credit a) provide lending terms and b) sufficiently balance risk such that financial inclusion is expanded.
Success Metrics
The success of the project will be measured in accurately predicting loan defaults, so that an increase in loan approval rates among this underrepresented group is accompanied by a stable or even reduction in default rates.
Analytics Approach
A supervised machine learning classification approach will be used to predict loan default risk, leveraging alternative data sources such as telco and transactional information, with the target variable being loan repayment.
Scope
The project will deliver a predictive model for assessing loan repayment abilities.
Details
The project will be executed by a team of business analysts, equipped with data processing and modeling skills. The project will feature iterative milestones for data exploration, model development, and evaluation, culminating in a final model by calendar year end.
My Approach
For this exploratory data analysis, I plan to use the following languages, APIs, and tools:
- Quarto: probably “the” notebook of choice for the scientific community; built and maintained by Posit, it’s flexible, powerful, and beautiful.
- Python: I usually default to R for analytical tasks so I could use the extra practice in Python.
- DuckDB: an in-process analytical database engine. It is extremely fast and features a convenient syntax. It’s gathered tremendous steam in the industry, even for analysis tasks.
Within the Python ecosystem I’ll use several libraries that will augment the analytical process. These include, but are not limited to:
- pandas: probably the foundational package for data analysis in Python
- scikit-learn: a comprehensive API for data science models and workflows
- statsmodels: one of the best statistics libraries in Python
- plotnine: the {ggplot2} equivalent in Python, bringing “grammar of graphics”
- skimpy: a neat package that started in R designed to summarize tabular data in a digestible way
Analysis Preparation
Data Source
The data is sourced from Kaggle where Home Credit hosted a competitiion back in the summer of 2018. Data was downloaded therefrom on August 20th, 2024 and used concurrently throughout the Capstone course at University of Utah, Fall 2024.
Loading Data
We’ll start off importing duckdb. We’ll do much of the data processing work using this powerful SQL engine.
With access to the API, we can begin to query our data located in files. We need only setup “relations” (informing duckdb where the files are located).
# DuckDB relations
# These are tables against which we can query using the DuckDB API
duckdb.read_csv("data/application_test.csv")
duckdb.read_csv("data/application_train.csv")
duckdb.read_csv("data/bureau_balance.csv")
duckdb.read_csv("data/bureau.csv")
duckdb.read_csv("data/credit_card_balance.csv")
duckdb.read_csv("data/installments_payments.csv")
duckdb.read_csv("data/POS_CASH_balance.csv")
duckdb.read_csv("data/previous_application.csv")This approach isn’t what you’d call “conventional”. However, DuckDB is uniquely suited for this type of workload. We have multiple files with millions of records each. DuckDb is more performant over other Python ecosystem libraries and packages in the R ecosystem.
Okay, so it’s well suited to the problem, but how are we expecting to work with these data? We’ll walk through some use cases quickly to demonstrate how seamless it will actually be.
Working With the Data
Straight away, we can interact with these files with plain SQL. For example, with a simple query (below) we can look at the first 10 rows of the bureau.csv file.
duckdb.sql("SELECT * FROM 'data/bureau.csv' LIMIT 10").show()┌────────────┬──────────────┬───────────────┬─────────────────┬─────────────┬────────────────────┬─────────────────────┬───────────────────┬────────────────────────┬────────────────────┬────────────────┬─────────────────────┬──────────────────────┬────────────────────────┬─────────────────┬────────────────────┬─────────────┐
│ SK_ID_CURR │ SK_ID_BUREAU │ CREDIT_ACTIVE │ CREDIT_CURRENCY │ DAYS_CREDIT │ CREDIT_DAY_OVERDUE │ DAYS_CREDIT_ENDDATE │ DAYS_ENDDATE_FACT │ AMT_CREDIT_MAX_OVERDUE │ CNT_CREDIT_PROLONG │ AMT_CREDIT_SUM │ AMT_CREDIT_SUM_DEBT │ AMT_CREDIT_SUM_LIMIT │ AMT_CREDIT_SUM_OVERDUE │ CREDIT_TYPE │ DAYS_CREDIT_UPDATE │ AMT_ANNUITY │
│ int64 │ int64 │ varchar │ varchar │ int64 │ int64 │ double │ double │ double │ int64 │ double │ double │ double │ double │ varchar │ int64 │ double │
├────────────┼──────────────┼───────────────┼─────────────────┼─────────────┼────────────────────┼─────────────────────┼───────────────────┼────────────────────────┼────────────────────┼────────────────┼─────────────────────┼──────────────────────┼────────────────────────┼─────────────────┼────────────────────┼─────────────┤
│ 215354 │ 5714462 │ Closed │ currency 1 │ -497 │ 0 │ -153.0 │ -153.0 │ NULL │ 0 │ 91323.0 │ 0.0 │ NULL │ 0.0 │ Consumer credit │ -131 │ NULL │
│ 215354 │ 5714463 │ Active │ currency 1 │ -208 │ 0 │ 1075.0 │ NULL │ NULL │ 0 │ 225000.0 │ 171342.0 │ NULL │ 0.0 │ Credit card │ -20 │ NULL │
│ 215354 │ 5714464 │ Active │ currency 1 │ -203 │ 0 │ 528.0 │ NULL │ NULL │ 0 │ 464323.5 │ NULL │ NULL │ 0.0 │ Consumer credit │ -16 │ NULL │
│ 215354 │ 5714465 │ Active │ currency 1 │ -203 │ 0 │ NULL │ NULL │ NULL │ 0 │ 90000.0 │ NULL │ NULL │ 0.0 │ Credit card │ -16 │ NULL │
│ 215354 │ 5714466 │ Active │ currency 1 │ -629 │ 0 │ 1197.0 │ NULL │ 77674.5 │ 0 │ 2700000.0 │ NULL │ NULL │ 0.0 │ Consumer credit │ -21 │ NULL │
│ 215354 │ 5714467 │ Active │ currency 1 │ -273 │ 0 │ 27460.0 │ NULL │ 0.0 │ 0 │ 180000.0 │ 71017.38 │ 108982.62 │ 0.0 │ Credit card │ -31 │ NULL │
│ 215354 │ 5714468 │ Active │ currency 1 │ -43 │ 0 │ 79.0 │ NULL │ 0.0 │ 0 │ 42103.8 │ 42103.8 │ 0.0 │ 0.0 │ Consumer credit │ -22 │ NULL │
│ 162297 │ 5714469 │ Closed │ currency 1 │ -1896 │ 0 │ -1684.0 │ -1710.0 │ 14985.0 │ 0 │ 76878.45 │ 0.0 │ 0.0 │ 0.0 │ Consumer credit │ -1710 │ NULL │
│ 162297 │ 5714470 │ Closed │ currency 1 │ -1146 │ 0 │ -811.0 │ -840.0 │ 0.0 │ 0 │ 103007.7 │ 0.0 │ 0.0 │ 0.0 │ Consumer credit │ -840 │ NULL │
│ 162297 │ 5714471 │ Active │ currency 1 │ -1146 │ 0 │ -484.0 │ NULL │ 0.0 │ 0 │ 4500.0 │ 0.0 │ 0.0 │ 0.0 │ Credit card │ -690 │ NULL │
├────────────┴──────────────┴───────────────┴─────────────────┴─────────────┴────────────────────┴─────────────────────┴───────────────────┴────────────────────────┴────────────────────┴────────────────┴─────────────────────┴──────────────────────┴────────────────────────┴─────────────────┴────────────────────┴─────────────┤
│ 10 rows 17 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
DuckDB does a nice job styling the output and including helpful details, such as data types.
But don’t think we’re “stuck” in the world of SQL for the duration of this analysis; far from it. DuckDB is very adept at refactoring the data for use with the other packages we’ll be using. Let’s bring in pandas and see how this works:
import pandas as pddf = duckdb.sql("SELECT * FROM 'data/bureau.csv' LIMIT 10").fetchdf()
type(df)pandas.core.frame.DataFrame
As you can see, we now have a pandas data frame. We could run a model with scikit-learn, generate a visualization with plotnine, or perform complex, custom logic ill-fit for SQL. For now, we’ll just select every other column:
df_sub = df.loc[:,::2]
df_sub.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SK_ID_CURR 10 non-null int64
1 CREDIT_ACTIVE 10 non-null object
2 DAYS_CREDIT 10 non-null int64
3 DAYS_CREDIT_ENDDATE 9 non-null float64
4 AMT_CREDIT_MAX_OVERDUE 6 non-null float64
5 AMT_CREDIT_SUM 10 non-null float64
6 AMT_CREDIT_SUM_LIMIT 5 non-null float64
7 CREDIT_TYPE 10 non-null object
8 AMT_ANNUITY 0 non-null float64
dtypes: float64(5), int64(2), object(2)
memory usage: 852.0+ bytes
At this point, we may be ready to leverage the speed and efficiency of DuckDB. So we can just switch right back!
duckdb.sql("SELECT * FROM df_sub").show()┌────────────┬───────────────┬─────────────┬─────────────────────┬────────────────────────┬────────────────┬──────────────────────┬─────────────────┬─────────────┐
│ SK_ID_CURR │ CREDIT_ACTIVE │ DAYS_CREDIT │ DAYS_CREDIT_ENDDATE │ AMT_CREDIT_MAX_OVERDUE │ AMT_CREDIT_SUM │ AMT_CREDIT_SUM_LIMIT │ CREDIT_TYPE │ AMT_ANNUITY │
│ int64 │ varchar │ int64 │ double │ double │ double │ double │ varchar │ double │
├────────────┼───────────────┼─────────────┼─────────────────────┼────────────────────────┼────────────────┼──────────────────────┼─────────────────┼─────────────┤
│ 215354 │ Closed │ -497 │ -153.0 │ NULL │ 91323.0 │ NULL │ Consumer credit │ NULL │
│ 215354 │ Active │ -208 │ 1075.0 │ NULL │ 225000.0 │ NULL │ Credit card │ NULL │
│ 215354 │ Active │ -203 │ 528.0 │ NULL │ 464323.5 │ NULL │ Consumer credit │ NULL │
│ 215354 │ Active │ -203 │ NULL │ NULL │ 90000.0 │ NULL │ Credit card │ NULL │
│ 215354 │ Active │ -629 │ 1197.0 │ 77674.5 │ 2700000.0 │ NULL │ Consumer credit │ NULL │
│ 215354 │ Active │ -273 │ 27460.0 │ 0.0 │ 180000.0 │ 108982.62 │ Credit card │ NULL │
│ 215354 │ Active │ -43 │ 79.0 │ 0.0 │ 42103.8 │ 0.0 │ Consumer credit │ NULL │
│ 162297 │ Closed │ -1896 │ -1684.0 │ 14985.0 │ 76878.45 │ 0.0 │ Consumer credit │ NULL │
│ 162297 │ Closed │ -1146 │ -811.0 │ 0.0 │ 103007.7 │ 0.0 │ Consumer credit │ NULL │
│ 162297 │ Active │ -1146 │ -484.0 │ 0.0 │ 4500.0 │ 0.0 │ Credit card │ NULL │
├────────────┴───────────────┴─────────────┴─────────────────────┴────────────────────────┴────────────────┴──────────────────────┴─────────────────┴─────────────┤
│ 10 rows 9 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
There really won’t be any issue navigating between the different APIs. In fact, we’ll be able to leverage the unique strengths of each of them to best advance our analysis.
Exploration
Skim each dataset
We’ll start off by getting familiar with each of the data sets. We’ll use the {skimpy} package for this and do some visualizations. We’ll rely on the ERD to help with interpretation:
application_train.csv
Main tables - out train and test samples; target (binary); info about loan and loan applicant at application time.
This file is the same as application_test.csv except that it features the target variable.
from skimpy import skim
app_train_df = (
duckdb # Use DuckDB's query engine
.sql("SELECT * FROM 'data/application_train.csv'") # Read the file
.fetchdf() # Convert to a pandas dataframe
)
skim(app_train_df) # "Skim" the data set╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 307511 │ │ float64 │ 65 │ │ │ │ Number of columns │ 122 │ │ int32 │ 41 │ │ │ └───────────────────┴────────┘ │ string │ 16 │ │ │ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓ │ │ ┃ column_n ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ │ │ ┃ ame ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩ │ │ │ SK_ID_CU │ 0 │ 0 │ 278200 │ 102800 │ 100000 │ 189100 │ 278200 │ 367100 │ 456300 │ ▇▇▇▇▇▇ │ │ │ │ RR │ │ │ │ │ │ │ │ │ │ │ │ │ │ TARGET │ 0 │ 0 │ 0.08073 │ 0.2724 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ ▁ │ │ │ │ CNT_CHIL │ 0 │ 0 │ 0.4171 │ 0.7221 │ 0 │ 0 │ 0 │ 1 │ 19 │ ▇ │ │ │ │ DREN │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_INCO │ 0 │ 0 │ 168800 │ 237100 │ 25650 │ 112500 │ 147200 │ 202500 │ 1170000 │ ▇ │ │ │ │ ME_TOTAL │ │ │ │ │ │ │ │ │ 00 │ │ │ │ │ AMT_CRED │ 0 │ 0 │ 599000 │ 402500 │ 45000 │ 270000 │ 513500 │ 808600 │ 4050000 │ ▇▃ │ │ │ │ IT │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_ANNU │ 12 │ 0 │ 27110 │ 14490 │ 1616 │ 16520 │ 24900 │ 34600 │ 258000 │ ▇▁ │ │ │ │ ITY │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_GOOD │ 278 │ 0.09 │ 538400 │ 369400 │ 40500 │ 238500 │ 450000 │ 679500 │ 4050000 │ ▇▂ │ │ │ │ S_PRICE │ │ │ │ │ │ │ │ │ │ │ │ │ │ REGION_P │ 0 │ 0 │ 0.02087 │ 0.01383 │ 0.00029 │ 0.01001 │ 0.01885 │ 0.02866 │ 0.07251 │ ▇▇▇▁ ▁ │ │ │ │ OPULATIO │ │ │ │ │ │ │ │ │ │ │ │ │ │ N_RELATI │ │ │ │ │ │ │ │ │ │ │ │ │ │ VE │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_BIR │ 0 │ 0 │ -16040 │ 4364 │ -25230 │ -19680 │ -15750 │ -12410 │ -7489 │ ▃▆▆▇▇▃ │ │ │ │ TH │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_EMP │ 0 │ 0 │ 63820 │ 141300 │ -17910 │ -2760 │ -1213 │ -289 │ 365200 │ ▇ ▂ │ │ │ │ LOYED │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_REG │ 0 │ 0 │ -4986 │ 3523 │ -24670 │ -7480 │ -4504 │ -2010 │ 0 │ ▁▃▆▇ │ │ │ │ ISTRATIO │ │ │ │ │ │ │ │ │ │ │ │ │ │ N │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_ID_ │ 0 │ 0 │ -2994 │ 1509 │ -7197 │ -4299 │ -3254 │ -1720 │ 0 │ ▂▇▅▅▃ │ │ │ │ PUBLISH │ │ │ │ │ │ │ │ │ │ │ │ │ │ OWN_CAR_ │ 202929 │ 65.99 │ 12.06 │ 11.94 │ 0 │ 5 │ 9 │ 15 │ 91 │ ▇▂ │ │ │ │ AGE │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_MOB │ 0 │ 0 │ 1 │ 0.00180 │ 0 │ 1 │ 1 │ 1 │ 1 │ ▇ │ │ │ │ IL │ │ │ │ 3 │ │ │ │ │ │ │ │ │ │ FLAG_EMP │ 0 │ 0 │ 0.8199 │ 0.3843 │ 0 │ 1 │ 1 │ 1 │ 1 │ ▂ ▇ │ │ │ │ _PHONE │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_WOR │ 0 │ 0 │ 0.1994 │ 0.3995 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ ▂ │ │ │ │ K_PHONE │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_CON │ 0 │ 0 │ 0.9981 │ 0.04316 │ 0 │ 1 │ 1 │ 1 │ 1 │ ▇ │ │ │ │ T_MOBILE │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_PHO │ 0 │ 0 │ 0.2811 │ 0.4495 │ 0 │ 0 │ 0 │ 1 │ 1 │ ▇ ▃ │ │ │ │ NE │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_EMA │ 0 │ 0 │ 0.05672 │ 0.2313 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ IL │ │ │ │ │ │ │ │ │ │ │ │ │ │ CNT_FAM_ │ 2 │ 0 │ 2.153 │ 0.9107 │ 1 │ 2 │ 2 │ 3 │ 20 │ ▇ │ │ │ │ MEMBERS │ │ │ │ │ │ │ │ │ │ │ │ │ │ REGION_R │ 0 │ 0 │ 2.052 │ 0.509 │ 1 │ 2 │ 2 │ 2 │ 3 │ ▁ ▇ ▂ │ │ │ │ ATING_CL │ │ │ │ │ │ │ │ │ │ │ │ │ │ IENT │ │ │ │ │ │ │ │ │ │ │ │ │ │ REGION_R │ 0 │ 0 │ 2.032 │ 0.5027 │ 1 │ 2 │ 2 │ 2 │ 3 │ ▁ ▇ ▂ │ │ │ │ ATING_CL │ │ │ │ │ │ │ │ │ │ │ │ │ │ IENT_W_C │ │ │ │ │ │ │ │ │ │ │ │ │ │ ITY │ │ │ │ │ │ │ │ │ │ │ │ │ │ HOUR_APP │ 0 │ 0 │ 12.06 │ 3.266 │ 0 │ 10 │ 12 │ 14 │ 23 │ ▁▇▇▃ │ │ │ │ R_PROCES │ │ │ │ │ │ │ │ │ │ │ │ │ │ S_START │ │ │ │ │ │ │ │ │ │ │ │ │ │ REG_REGI │ 0 │ 0 │ 0.01514 │ 0.1221 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ ON_NOT_L │ │ │ │ │ │ │ │ │ │ │ │ │ │ IVE_REGI │ │ │ │ │ │ │ │ │ │ │ │ │ │ ON │ │ │ │ │ │ │ │ │ │ │ │ │ │ REG_REGI │ 0 │ 0 │ 0.05077 │ 0.2195 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ ON_NOT_W │ │ │ │ │ │ │ │ │ │ │ │ │ │ ORK_REGI │ │ │ │ │ │ │ │ │ │ │ │ │ │ ON │ │ │ │ │ │ │ │ │ │ │ │ │ │ LIVE_REG │ 0 │ 0 │ 0.04066 │ 0.1975 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ ION_NOT_ │ │ │ │ │ │ │ │ │ │ │ │ │ │ WORK_REG │ │ │ │ │ │ │ │ │ │ │ │ │ │ ION │ │ │ │ │ │ │ │ │ │ │ │ │ │ REG_CITY │ 0 │ 0 │ 0.07817 │ 0.2684 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ ▁ │ │ │ │ _NOT_LIV │ │ │ │ │ │ │ │ │ │ │ │ │ │ E_CITY │ │ │ │ │ │ │ │ │ │ │ │ │ │ REG_CITY │ 0 │ 0 │ 0.2305 │ 0.4211 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ ▂ │ │ │ │ _NOT_WOR │ │ │ │ │ │ │ │ │ │ │ │ │ │ K_CITY │ │ │ │ │ │ │ │ │ │ │ │ │ │ LIVE_CIT │ 0 │ 0 │ 0.1796 │ 0.3838 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ ▂ │ │ │ │ Y_NOT_WO │ │ │ │ │ │ │ │ │ │ │ │ │ │ RK_CITY │ │ │ │ │ │ │ │ │ │ │ │ │ │ EXT_SOUR │ 173378 │ 56.38 │ 0.5021 │ 0.2111 │ 0.01457 │ 0.334 │ 0.506 │ 0.6751 │ 0.9627 │ ▂▆▇▇▇▃ │ │ │ │ CE_1 │ │ │ │ │ │ │ │ │ │ │ │ │ │ EXT_SOUR │ 660 │ 0.21 │ 0.5144 │ 0.1911 │ 8.174e-0 │ 0.3925 │ 0.566 │ 0.6636 │ 0.855 │ ▁▂▃▅▇▃ │ │ │ │ CE_2 │ │ │ │ │ 8 │ │ │ │ │ │ │ │ │ EXT_SOUR │ 60965 │ 19.83 │ 0.5109 │ 0.1948 │ 0.000527 │ 0.3706 │ 0.5353 │ 0.6691 │ 0.896 │ ▁▃▅▇▇▃ │ │ │ │ CE_3 │ │ │ │ │ 3 │ │ │ │ │ │ │ │ │ APARTMEN │ 156061 │ 50.75 │ 0.1174 │ 0.1082 │ 0 │ 0.0577 │ 0.0876 │ 0.1485 │ 1 │ ▇▂ │ │ │ │ TS_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ BASEMENT │ 179943 │ 58.52 │ 0.08844 │ 0.08244 │ 0 │ 0.0442 │ 0.0763 │ 0.1122 │ 1 │ ▇▁ │ │ │ │ AREA_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ YEARS_BE │ 150007 │ 48.78 │ 0.9777 │ 0.05922 │ 0 │ 0.9767 │ 0.9816 │ 0.9866 │ 1 │ ▇ │ │ │ │ GINEXPLU │ │ │ │ │ │ │ │ │ │ │ │ │ │ ATATION_ │ │ │ │ │ │ │ │ │ │ │ │ │ │ AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ YEARS_BU │ 204488 │ 66.5 │ 0.7525 │ 0.1133 │ 0 │ 0.6872 │ 0.7552 │ 0.8232 │ 1 │ ▂▇▃ │ │ │ │ ILD_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ COMMONAR │ 214865 │ 69.87 │ 0.04462 │ 0.07604 │ 0 │ 0.0078 │ 0.0211 │ 0.0515 │ 1 │ ▇ │ │ │ │ EA_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ ELEVATOR │ 163891 │ 53.3 │ 0.07894 │ 0.1346 │ 0 │ 0 │ 0 │ 0.12 │ 1 │ ▇▁ │ │ │ │ S_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ ENTRANCE │ 154828 │ 50.35 │ 0.1497 │ 0.1 │ 0 │ 0.069 │ 0.1379 │ 0.2069 │ 1 │ ▇▃ │ │ │ │ S_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLOORSMA │ 153020 │ 49.76 │ 0.2263 │ 0.1446 │ 0 │ 0.1667 │ 0.1667 │ 0.3333 │ 1 │ ▃▇▁ │ │ │ │ X_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLOORSMI │ 208642 │ 67.85 │ 0.2319 │ 0.1614 │ 0 │ 0.0833 │ 0.2083 │ 0.375 │ 1 │ ▆▇▅▁▁ │ │ │ │ N_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ LANDAREA │ 182590 │ 59.38 │ 0.06633 │ 0.08118 │ 0 │ 0.0187 │ 0.0481 │ 0.0856 │ 1 │ ▇ │ │ │ │ _AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ LIVINGAP │ 210199 │ 68.35 │ 0.1008 │ 0.09258 │ 0 │ 0.0504 │ 0.0756 │ 0.121 │ 1 │ ▇▁ │ │ │ │ ARTMENTS │ │ │ │ │ │ │ │ │ │ │ │ │ │ _AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ LIVINGAR │ 154350 │ 50.19 │ 0.1074 │ 0.1106 │ 0 │ 0.0453 │ 0.0745 │ 0.1299 │ 1 │ ▇▁ │ │ │ │ EA_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ NONLIVIN │ 213514 │ 69.43 │ 0.008809 │ 0.04773 │ 0 │ 0 │ 0 │ 0.0039 │ 1 │ ▇ │ │ │ │ GAPARTME │ │ │ │ │ │ │ │ │ │ │ │ │ │ NTS_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ NONLIVIN │ 169682 │ 55.18 │ 0.02836 │ 0.06952 │ 0 │ 0 │ 0.0036 │ 0.0277 │ 1 │ ▇ │ │ │ │ GAREA_AV │ │ │ │ │ │ │ │ │ │ │ │ │ │ G │ │ │ │ │ │ │ │ │ │ │ │ │ │ APARTMEN │ 156061 │ 50.75 │ 0.1142 │ 0.1079 │ 0 │ 0.0525 │ 0.084 │ 0.1439 │ 1 │ ▇▁ │ │ │ │ TS_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ BASEMENT │ 179943 │ 58.52 │ 0.08754 │ 0.08431 │ 0 │ 0.0407 │ 0.0746 │ 0.1124 │ 1 │ ▇▁ │ │ │ │ AREA_MOD │ │ │ │ │ │ │ │ │ │ │ │ │ │ E │ │ │ │ │ │ │ │ │ │ │ │ │ │ YEARS_BE │ 150007 │ 48.78 │ 0.9771 │ 0.06458 │ 0 │ 0.9767 │ 0.9816 │ 0.9866 │ 1 │ ▇ │ │ │ │ GINEXPLU │ │ │ │ │ │ │ │ │ │ │ │ │ │ ATATION_ │ │ │ │ │ │ │ │ │ │ │ │ │ │ MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ YEARS_BU │ 204488 │ 66.5 │ 0.7596 │ 0.1101 │ 0 │ 0.6994 │ 0.7648 │ 0.8236 │ 1 │ ▂▇▃ │ │ │ │ ILD_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ COMMONAR │ 214865 │ 69.87 │ 0.04255 │ 0.07444 │ 0 │ 0.0072 │ 0.019 │ 0.049 │ 1 │ ▇ │ │ │ │ EA_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ ELEVATOR │ 163891 │ 53.3 │ 0.07449 │ 0.1323 │ 0 │ 0 │ 0 │ 0.1208 │ 1 │ ▇▁ │ │ │ │ S_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ ENTRANCE │ 154828 │ 50.35 │ 0.1452 │ 0.101 │ 0 │ 0.069 │ 0.1379 │ 0.2069 │ 1 │ ▇▃ │ │ │ │ S_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLOORSMA │ 153020 │ 49.76 │ 0.2223 │ 0.1437 │ 0 │ 0.1667 │ 0.1667 │ 0.3333 │ 1 │ ▃▇▁ │ │ │ │ X_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLOORSMI │ 208642 │ 67.85 │ 0.2281 │ 0.1612 │ 0 │ 0.0833 │ 0.2083 │ 0.375 │ 1 │ ▆▇▅▁▁ │ │ │ │ N_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ LANDAREA │ 182590 │ 59.38 │ 0.06496 │ 0.08175 │ 0 │ 0.0166 │ 0.0458 │ 0.0841 │ 1 │ ▇ │ │ │ │ _MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ LIVINGAP │ 210199 │ 68.35 │ 0.1056 │ 0.09788 │ 0 │ 0.0542 │ 0.0771 │ 0.1313 │ 1 │ ▇▁ │ │ │ │ ARTMENTS │ │ │ │ │ │ │ │ │ │ │ │ │ │ _MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ LIVINGAR │ 154350 │ 50.19 │ 0.106 │ 0.1118 │ 0 │ 0.0427 │ 0.0731 │ 0.1252 │ 1 │ ▇▁ │ │ │ │ EA_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ NONLIVIN │ 213514 │ 69.43 │ 0.008076 │ 0.04628 │ 0 │ 0 │ 0 │ 0.0039 │ 1 │ ▇ │ │ │ │ GAPARTME │ │ │ │ │ │ │ │ │ │ │ │ │ │ NTS_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ NONLIVIN │ 169682 │ 55.18 │ 0.02702 │ 0.07025 │ 0 │ 0 │ 0.0011 │ 0.0231 │ 1 │ ▇ │ │ │ │ GAREA_MO │ │ │ │ │ │ │ │ │ │ │ │ │ │ DE │ │ │ │ │ │ │ │ │ │ │ │ │ │ APARTMEN │ 156061 │ 50.75 │ 0.1178 │ 0.1091 │ 0 │ 0.0583 │ 0.0864 │ 0.1489 │ 1 │ ▇▁ │ │ │ │ TS_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ BASEMENT │ 179943 │ 58.52 │ 0.08795 │ 0.08218 │ 0 │ 0.0437 │ 0.0758 │ 0.1116 │ 1 │ ▇▁ │ │ │ │ AREA_MED │ │ │ │ │ │ │ │ │ │ │ │ │ │ I │ │ │ │ │ │ │ │ │ │ │ │ │ │ YEARS_BE │ 150007 │ 48.78 │ 0.9778 │ 0.0599 │ 0 │ 0.9767 │ 0.9816 │ 0.9866 │ 1 │ ▇ │ │ │ │ GINEXPLU │ │ │ │ │ │ │ │ │ │ │ │ │ │ ATATION_ │ │ │ │ │ │ │ │ │ │ │ │ │ │ MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ YEARS_BU │ 204488 │ 66.5 │ 0.7557 │ 0.1121 │ 0 │ 0.6914 │ 0.7585 │ 0.8256 │ 1 │ ▂▇▃ │ │ │ │ ILD_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ COMMONAR │ 214865 │ 69.87 │ 0.0446 │ 0.07614 │ 0 │ 0.0079 │ 0.0208 │ 0.0513 │ 1 │ ▇ │ │ │ │ EA_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ ELEVATOR │ 163891 │ 53.3 │ 0.07808 │ 0.1345 │ 0 │ 0 │ 0 │ 0.12 │ 1 │ ▇▁ │ │ │ │ S_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ ENTRANCE │ 154828 │ 50.35 │ 0.1492 │ 0.1004 │ 0 │ 0.069 │ 0.1379 │ 0.2069 │ 1 │ ▇▃ │ │ │ │ S_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLOORSMA │ 153020 │ 49.76 │ 0.2259 │ 0.1451 │ 0 │ 0.1667 │ 0.1667 │ 0.3333 │ 1 │ ▃▇▁ │ │ │ │ X_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLOORSMI │ 208642 │ 67.85 │ 0.2316 │ 0.1619 │ 0 │ 0.0833 │ 0.2083 │ 0.375 │ 1 │ ▆▇▅▁▁ │ │ │ │ N_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ LANDAREA │ 182590 │ 59.38 │ 0.06717 │ 0.08217 │ 0 │ 0.0187 │ 0.0487 │ 0.0868 │ 1 │ ▇ │ │ │ │ _MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ LIVINGAP │ 210199 │ 68.35 │ 0.102 │ 0.09364 │ 0 │ 0.0513 │ 0.0761 │ 0.1231 │ 1 │ ▇▁ │ │ │ │ ARTMENTS │ │ │ │ │ │ │ │ │ │ │ │ │ │ _MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ LIVINGAR │ 154350 │ 50.19 │ 0.1086 │ 0.1123 │ 0 │ 0.0457 │ 0.0749 │ 0.1303 │ 1 │ ▇▁ │ │ │ │ EA_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ NONLIVIN │ 213514 │ 69.43 │ 0.008651 │ 0.04741 │ 0 │ 0 │ 0 │ 0.0039 │ 1 │ ▇ │ │ │ │ GAPARTME │ │ │ │ │ │ │ │ │ │ │ │ │ │ NTS_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ NONLIVIN │ 169682 │ 55.18 │ 0.02824 │ 0.07017 │ 0 │ 0 │ 0.0031 │ 0.0266 │ 1 │ ▇ │ │ │ │ GAREA_ME │ │ │ │ │ │ │ │ │ │ │ │ │ │ DI │ │ │ │ │ │ │ │ │ │ │ │ │ │ TOTALARE │ 148431 │ 48.27 │ 0.1025 │ 0.1075 │ 0 │ 0.0412 │ 0.0688 │ 0.1276 │ 1 │ ▇▁ │ │ │ │ A_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ OBS_30_C │ 1021 │ 0.33 │ 1.422 │ 2.401 │ 0 │ 0 │ 0 │ 2 │ 348 │ ▇ │ │ │ │ NT_SOCIA │ │ │ │ │ │ │ │ │ │ │ │ │ │ L_CIRCLE │ │ │ │ │ │ │ │ │ │ │ │ │ │ DEF_30_C │ 1021 │ 0.33 │ 0.1434 │ 0.4467 │ 0 │ 0 │ 0 │ 0 │ 34 │ ▇ │ │ │ │ NT_SOCIA │ │ │ │ │ │ │ │ │ │ │ │ │ │ L_CIRCLE │ │ │ │ │ │ │ │ │ │ │ │ │ │ OBS_60_C │ 1021 │ 0.33 │ 1.405 │ 2.38 │ 0 │ 0 │ 0 │ 2 │ 344 │ ▇ │ │ │ │ NT_SOCIA │ │ │ │ │ │ │ │ │ │ │ │ │ │ L_CIRCLE │ │ │ │ │ │ │ │ │ │ │ │ │ │ DEF_60_C │ 1021 │ 0.33 │ 0.1 │ 0.3623 │ 0 │ 0 │ 0 │ 0 │ 24 │ ▇ │ │ │ │ NT_SOCIA │ │ │ │ │ │ │ │ │ │ │ │ │ │ L_CIRCLE │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_LAS │ 1 │ 0 │ -962.9 │ 826.8 │ -4292 │ -1570 │ -757 │ -274 │ 0 │ ▁▃▃▇ │ │ │ │ T_PHONE_ │ │ │ │ │ │ │ │ │ │ │ │ │ │ CHANGE │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 4.227e-0 │ 0.00650 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_2 │ │ │ 5 │ 2 │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.71 │ 0.4538 │ 0 │ 0 │ 1 │ 1 │ 1 │ ▃ ▇ │ │ │ │ UMENT_3 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 8.13e-05 │ 0.00901 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_4 │ │ │ │ 6 │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.01511 │ 0.122 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_5 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.08806 │ 0.2834 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ ▁ │ │ │ │ UMENT_6 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.000191 │ 0.01385 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_7 │ │ │ 9 │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.08138 │ 0.2734 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ ▁ │ │ │ │ UMENT_8 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.003896 │ 0.06229 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_9 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 2.276e-0 │ 0.00477 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_10 │ │ │ 5 │ 1 │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.003912 │ 0.06242 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_11 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 6.504e-0 │ 0.00255 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_12 │ │ │ 6 │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.003525 │ 0.05927 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_13 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.002936 │ 0.05411 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_14 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.00121 │ 0.03476 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_15 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.009928 │ 0.09914 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_16 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.000266 │ 0.01633 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_17 │ │ │ 7 │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.00813 │ 0.0898 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_18 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.000595 │ 0.02439 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_19 │ │ │ 1 │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.000507 │ 0.02252 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_20 │ │ │ 3 │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.000334 │ 0.0183 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_21 │ │ │ 9 │ │ │ │ │ │ │ │ │ │ │ AMT_REQ_ │ 41519 │ 13.5 │ 0.006402 │ 0.08385 │ 0 │ 0 │ 0 │ 0 │ 4 │ ▇ │ │ │ │ CREDIT_B │ │ │ │ │ │ │ │ │ │ │ │ │ │ UREAU_HO │ │ │ │ │ │ │ │ │ │ │ │ │ │ UR │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_REQ_ │ 41519 │ 13.5 │ 0.007 │ 0.1108 │ 0 │ 0 │ 0 │ 0 │ 9 │ ▇ │ │ │ │ CREDIT_B │ │ │ │ │ │ │ │ │ │ │ │ │ │ UREAU_DA │ │ │ │ │ │ │ │ │ │ │ │ │ │ Y │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_REQ_ │ 41519 │ 13.5 │ 0.03436 │ 0.2047 │ 0 │ 0 │ 0 │ 0 │ 8 │ ▇ │ │ │ │ CREDIT_B │ │ │ │ │ │ │ │ │ │ │ │ │ │ UREAU_WE │ │ │ │ │ │ │ │ │ │ │ │ │ │ EK │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_REQ_ │ 41519 │ 13.5 │ 0.2674 │ 0.916 │ 0 │ 0 │ 0 │ 0 │ 27 │ ▇ │ │ │ │ CREDIT_B │ │ │ │ │ │ │ │ │ │ │ │ │ │ UREAU_MO │ │ │ │ │ │ │ │ │ │ │ │ │ │ N │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_REQ_ │ 41519 │ 13.5 │ 0.2655 │ 0.7941 │ 0 │ 0 │ 0 │ 0 │ 261 │ ▇ │ │ │ │ CREDIT_B │ │ │ │ │ │ │ │ │ │ │ │ │ │ UREAU_QR │ │ │ │ │ │ │ │ │ │ │ │ │ │ T │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_REQ_ │ 41519 │ 13.5 │ 1.9 │ 1.869 │ 0 │ 0 │ 1 │ 3 │ 25 │ ▇▁ │ │ │ │ CREDIT_B │ │ │ │ │ │ │ │ │ │ │ │ │ │ UREAU_YE │ │ │ │ │ │ │ │ │ │ │ │ │ │ AR │ │ │ │ │ │ │ │ │ │ │ │ │ └──────────┴────────┴───────┴──────────┴─────────┴──────────┴─────────┴─────────┴─────────┴─────────┴────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩ │ │ │ NAME_CONTRACT_TYPE │ 0 │ 0 │ 2 │ 615022 │ │ │ │ CODE_GENDER │ 0 │ 0 │ 1 │ 307511 │ │ │ │ FLAG_OWN_CAR │ 0 │ 0 │ 1 │ 307511 │ │ │ │ FLAG_OWN_REALTY │ 0 │ 0 │ 1 │ 307511 │ │ │ │ NAME_TYPE_SUITE │ 1292 │ 0.42 │ 1 │ 318131 │ │ │ │ NAME_INCOME_TYPE │ 0 │ 0 │ 1.3 │ 400836 │ │ │ │ NAME_EDUCATION_TYPE │ 0 │ 0 │ 3.4 │ 1051804 │ │ │ │ NAME_FAMILY_STATUS │ 0 │ 0 │ 1.5 │ 473618 │ │ │ │ NAME_HOUSING_TYPE │ 0 │ 0 │ 2.9 │ 887890 │ │ │ │ OCCUPATION_TYPE │ 96391 │ 31.35 │ 1.1 │ 341374 │ │ │ │ WEEKDAY_APPR_PROCESS_START │ 0 │ 0 │ 1 │ 307511 │ │ │ │ ORGANIZATION_TYPE │ 0 │ 0 │ 2.1 │ 638609 │ │ │ │ FONDKAPREMONT_MODE │ 210295 │ 68.39 │ 0.97 │ 298041 │ │ │ │ HOUSETYPE_MODE │ 154297 │ 50.18 │ 1.5 │ 456931 │ │ │ │ WALLSMATERIAL_MODE │ 156341 │ 50.84 │ 0.7 │ 215985 │ │ │ │ EMERGENCYSTATE_MODE │ 145755 │ 47.4 │ 0.53 │ 161756 │ │ │ └────────────────────────────────────┴─────────────┴───────────┴────────────────────────┴────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
This is a LARGE data set. Let’s break down some of the observations we can glean from the summary:
Data Types
- There’s several variables casted as numeric that should be categorical
- Any variables with
FLAG_prefixes: ~32 columns- Examples:
FLAG_MOBIL,FLAG_DOCUMENT_#, …
- Examples:
- Other variables treated as flags but not specified in the name: ~6 columns
- Examples:
REG_REGION_NOT_LIVE_REGION,LIVE_CITY_NOT_WORK_CITY, …
- Examples:
REGION_RATING_variables that are ordinal classificaitons
- Any variables with
- Many variables of type
stringcould be candidates for categorical depending on count of unique values- Examples:
NAME_CONTRACT_TYPE,ORGANIZATION_TYPE
- Examples:
- Some variables casted as
stringbut may need to be numeric- Any with
_MODEsuffix
- Any with
Missing Data
EXT_SOURCE_#these are scores ingested from third parties- Understandably sparse; not every client would be represented in third-party systems
- Fields including
APARTMENT_,BASEMENT_,ELEVATOR_, etc., all relate to the client’s current dwelling.- 43 of the file’s 122 columns relate to the client’s dwelling
- These are very sparse, about ~50% complete. These fields represent most of the missing data
- Presumably, the lender was unable to collect such detail for every client
OBS_##_SOCIAL_CIRCLEandDEF_##_SOCIAL_CIRCLErelate to default rates on observation ## days past due- This is a count that’s heavily skewed to the right
- <1% of observations have these fields as missing
AMT_REQ_CREDIT_BUREAU_XXare fields measure the number of credit inqueries in XX timeframe- These data set is geared around users without sufficient credit history so makes sense this is missing
- ~13% of the data is missing
OCCUPATION_TYPErefers to the type of job the user has- Missing for virtually 1/3 clients
ORGANIZATION_TYPE, however, is not missing so we do retain some attribute relative to their work- Perhaps in combination with
NAME_EDUCATION_TYPEthere’s predictive power
Distribution
- All dwelling related variables have been normalized based on description column descriptions file
- Checking the histograms to the far right, this appears to be accurate
- There’s a couple variables with skewed distributions
DAYS_REGISTRATION: skewed rightDAYS_LAST_PHONE_CHANGE: skewed right
Potential feature engineering
- Dimensionality reduction
FLAG_DOCUMENT_#fields could potentially be summarized a few ways;- % of documents completed
- Finding principal components
- Maybe only certain documents matter
- There’s many measures of central tendency for dwelling (avg, mode, median)
- Likely not all for each perspective is needed
bureau.csv
Application data from previous loans that client got from other institutions that were reported to the Credit Bureau. One row per client’s loan in Credit Bureau
This file will tell us all about the previous loans any credit history client would have. Let’s digest the data:
bureau_df = (
duckdb # Use DuckDB's query engine
.sql("SELECT * FROM 'data/bureau.csv'") # Read the file
.fetchdf() # Convert to a pandas dataframe
)
skim(bureau_df) # "Skim" the data set╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 1716428 │ │ float64 │ 8 │ │ │ │ Number of columns │ 17 │ │ int32 │ 6 │ │ │ └───────────────────┴─────────┘ │ string │ 3 │ │ │ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓ │ │ ┃ column_n ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ │ │ ┃ ame ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩ │ │ │ SK_ID_CU │ 0 │ 0 │ 278200 │ 102900 │ 100000 │ 188900 │ 278100 │ 367400 │ 456300 │ ▇▇▇▇▇▇ │ │ │ │ RR │ │ │ │ │ │ │ │ │ │ │ │ │ │ SK_ID_BU │ 0 │ 0 │ 5924000 │ 532300 │ 5000000 │ 5464000 │ 5926000 │ 6386000 │ 6843000 │ ▇▇▇▇▇▇ │ │ │ │ REAU │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_CRE │ 0 │ 0 │ -1142 │ 795.2 │ -2922 │ -1666 │ -987 │ -474 │ 0 │ ▃▃▃▆▇▇ │ │ │ │ DIT │ │ │ │ │ │ │ │ │ │ │ │ │ │ CREDIT_D │ 0 │ 0 │ 0.8182 │ 36.54 │ 0 │ 0 │ 0 │ 0 │ 2792 │ ▇ │ │ │ │ AY_OVERD │ │ │ │ │ │ │ │ │ │ │ │ │ │ UE │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_CRE │ 105553 │ 6.15 │ 510.5 │ 4994 │ -42060 │ -1138 │ -330 │ 474 │ 31200 │ ▇ │ │ │ │ DIT_ENDD │ │ │ │ │ │ │ │ │ │ │ │ │ │ ATE │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_END │ 633653 │ 36.92 │ -1017 │ 714 │ -42020 │ -1489 │ -897 │ -425 │ 0 │ ▇ │ │ │ │ DATE_FAC │ │ │ │ │ │ │ │ │ │ │ │ │ │ T │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_CRED │ 1124488 │ 65.51 │ 3825 │ 206000 │ 0 │ 0 │ 0 │ 0 │ 1160000 │ ▇ │ │ │ │ IT_MAX_O │ │ │ │ │ │ │ │ │ 00 │ │ │ │ │ VERDUE │ │ │ │ │ │ │ │ │ │ │ │ │ │ CNT_CRED │ 0 │ 0 │ 0.00641 │ 0.09622 │ 0 │ 0 │ 0 │ 0 │ 9 │ ▇ │ │ │ │ IT_PROLO │ │ │ │ │ │ │ │ │ │ │ │ │ │ NG │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_CRED │ 13 │ 0 │ 355000 │ 1150000 │ 0 │ 51300 │ 125500 │ 315000 │ 5850000 │ ▇ │ │ │ │ IT_SUM │ │ │ │ │ │ │ │ │ 00 │ │ │ │ │ AMT_CRED │ 257669 │ 15.01 │ 137100 │ 677400 │ -4706000 │ 0 │ 0 │ 40150 │ 1701000 │ ▇ │ │ │ │ IT_SUM_D │ │ │ │ │ │ │ │ │ 00 │ │ │ │ │ EBT │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_CRED │ 591780 │ 34.48 │ 6230 │ 45030 │ -586400 │ 0 │ 0 │ 0 │ 4706000 │ ▇ │ │ │ │ IT_SUM_L │ │ │ │ │ │ │ │ │ │ │ │ │ │ IMIT │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_CRED │ 0 │ 0 │ 37.91 │ 5938 │ 0 │ 0 │ 0 │ 0 │ 3757000 │ ▇ │ │ │ │ IT_SUM_O │ │ │ │ │ │ │ │ │ │ │ │ │ │ VERDUE │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_CRE │ 0 │ 0 │ -593.7 │ 720.7 │ -41950 │ -908 │ -395 │ -33 │ 372 │ ▇ │ │ │ │ DIT_UPDA │ │ │ │ │ │ │ │ │ │ │ │ │ │ TE │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_ANNU │ 1226791 │ 71.47 │ 15710 │ 325800 │ 0 │ 0 │ 0 │ 13500 │ 1185000 │ ▇ │ │ │ │ ITY │ │ │ │ │ │ │ │ │ 00 │ │ │ │ └──────────┴─────────┴───────┴─────────┴─────────┴──────────┴─────────┴─────────┴─────────┴─────────┴────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ CREDIT_ACTIVE │ 0 │ 0 │ 1 │ 1716449 │ │ │ │ CREDIT_CURRENCY │ 0 │ 0 │ 2 │ 3432856 │ │ │ │ CREDIT_TYPE │ 0 │ 0 │ 2 │ 3410733 │ │ │ └─────────────────────────────────┴────────┴────────────┴────────────────────────────┴───────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
The first thing to notice is the ~1.7M records, compared to the previous file of ~360K. Cearly there will be multiple historical bureau records for each matching application.
Data Types
- All seems to be in fair order
- There’s opportunity with the 3
stringvariables to cast as acategorical
Missing Data
- 5 columns are missing data
DAYS_CREDIT_ENDDATEandDAYS_ENDDATE_FACTmeasure days between current application and end date of bureau credit; the end date must just be missingAMT_CREDIT_MAX_OVERDUEis curiously missing a lot of data; perhaps due to there not having been an overdue balance to reportAMT_CREDIT_SUM_DEBTis understandable if someone had no current debt sumsAMT_CREDIT_SUM_LIMITis understandably blank should a client not have a credit cardAMT_ANNUITYonly those with an annuity loan would have values here
Many of those seem resonable to populate with zeros in a cleaning phase.
Distribution
DAYS_CREDITseem to be the only significantly skewed variable
Potential feature engineering
- I’d be curious just how much variability in credit reports per individual could be explained by dimensionality reduction (PCA)
bureau_balance.csv
Monthly balance of credits in Credit Bureau
Essentially, we have an expanded version of balances per client loan reported to the bureau.
bureau_bal_df = (
duckdb # Use DuckDB's query engine
.sql("SELECT * FROM 'data/bureau_balance.csv'") # Read the file
.fetchdf() # Convert to a pandas dataframe
)
skim(bureau_bal_df) # "Skim" the data set╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 27299925 │ │ int32 │ 2 │ │ │ │ Number of columns │ 3 │ │ string │ 1 │ │ │ └───────────────────┴──────────┘ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━━━━━━━┳━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━━━╇━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩ │ │ │ SK_ID_BUREAU │ 0 │ 0 │ 6036000 │ 492300 │ 5002000 │ 5731000 │ 6071000 │ 6432000 │ 6843000 │ ▃▃▅▇▆▆ │ │ │ │ MONTHS_BALANCE │ 0 │ 0 │ -30.74 │ 23.86 │ -96 │ -46 │ -25 │ -11 │ 0 │ ▁▂▂▃▆▇ │ │ │ └─────────────────┴─────┴──────┴─────────┴────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ STATUS │ 0 │ 0 │ 1 │ 27299925 │ │ │ └───────────────────────────┴─────────┴────────────┴──────────────────────────────┴──────────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
The first thing I’m seeing is ~27.2M records. Multiple monthly balances per loan.
Data Types
- All seems to be in fair order
- There’s opportunity the
statusvariable to cast as acategorical
Missing Data
No missing data
Distribution
MONTHS_BLANCEseems to be skewed to the right
Potential feature engineering
- It’s possible this dataset can be represented in 2 variables:
- Existence of previous loan: boolean
- Median oustanding balance: numeric
previous_application.csv
Application data of client’s previous loans in Home Credit. Info about the previous loan parameters and client info at time of previous application. One row per previous application.
This file details previous applications with Home Credit. Some clients may have never applied for loans previously while others could have had multiple applications.
prev_app_df = (
duckdb # Use DuckDB's query engine
.sql("SELECT * FROM 'data/previous_application.csv'") # Read the file
.fetchdf() # Convert to a pandas dataframe
)
skim(prev_app_df) # "Skim" the data set╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 1670214 │ │ string │ 16 │ │ │ │ Number of columns │ 37 │ │ float64 │ 15 │ │ │ └───────────────────┴─────────┘ │ int32 │ 6 │ │ │ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓ │ │ ┃ column_n ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ │ │ ┃ ame ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩ │ │ │ SK_ID_PR │ 0 │ 0 │ 1923000 │ 532600 │ 1000000 │ 1462000 │ 1923000 │ 2384000 │ 2845000 │ ▇▇▇▇▇▇ │ │ │ │ EV │ │ │ │ │ │ │ │ │ │ │ │ │ │ SK_ID_CU │ 0 │ 0 │ 278400 │ 102800 │ 100000 │ 189300 │ 278700 │ 367500 │ 456300 │ ▇▇▇▇▇▇ │ │ │ │ RR │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_ANNU │ 372235 │ 22.29 │ 15960 │ 14780 │ 0 │ 6322 │ 11250 │ 20660 │ 418100 │ ▇ │ │ │ │ ITY │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_APPL │ 0 │ 0 │ 175200 │ 292800 │ 0 │ 18720 │ 71050 │ 180400 │ 6905000 │ ▇ │ │ │ │ ICATION │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_CRED │ 1 │ 0 │ 196100 │ 318600 │ 0 │ 24160 │ 80540 │ 216400 │ 6905000 │ ▇ │ │ │ │ IT │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_DOWN │ 895844 │ 53.64 │ 6697 │ 20920 │ -0.9 │ 0 │ 1638 │ 7740 │ 3060000 │ ▇ │ │ │ │ _PAYMENT │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_GOOD │ 385515 │ 23.08 │ 227800 │ 315400 │ 0 │ 50840 │ 112300 │ 234000 │ 6905000 │ ▇ │ │ │ │ S_PRICE │ │ │ │ │ │ │ │ │ │ │ │ │ │ HOUR_APP │ 0 │ 0 │ 12.48 │ 3.334 │ 0 │ 10 │ 12 │ 15 │ 23 │ ▁▇▇▃ │ │ │ │ R_PROCES │ │ │ │ │ │ │ │ │ │ │ │ │ │ S_START │ │ │ │ │ │ │ │ │ │ │ │ │ │ NFLAG_LA │ 0 │ 0 │ 0.9965 │ 0.05933 │ 0 │ 1 │ 1 │ 1 │ 1 │ ▇ │ │ │ │ ST_APPL_ │ │ │ │ │ │ │ │ │ │ │ │ │ │ IN_DAY │ │ │ │ │ │ │ │ │ │ │ │ │ │ RATE_DOW │ 895844 │ 53.64 │ 0.07964 │ 0.1078 │ -1.498e- │ 0 │ 0.05161 │ 0.1089 │ 1 │ ▇▁ │ │ │ │ N_PAYMEN │ │ │ │ │ 05 │ │ │ │ │ │ │ │ │ T │ │ │ │ │ │ │ │ │ │ │ │ │ │ RATE_INT │ 1664263 │ 99.64 │ 0.1884 │ 0.08767 │ 0.03478 │ 0.1607 │ 0.1891 │ 0.1933 │ 1 │ ▇▁ │ │ │ │ EREST_PR │ │ │ │ │ │ │ │ │ │ │ │ │ │ IMARY │ │ │ │ │ │ │ │ │ │ │ │ │ │ RATE_INT │ 1664263 │ 99.64 │ 0.7735 │ 0.1009 │ 0.3732 │ 0.7156 │ 0.8351 │ 0.8525 │ 1 │ ▂▂▇ │ │ │ │ EREST_PR │ │ │ │ │ │ │ │ │ │ │ │ │ │ IVILEGED │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_DEC │ 0 │ 0 │ -880.7 │ 779.1 │ -2922 │ -1300 │ -581 │ -280 │ -1 │ ▁▁▁▂▅▇ │ │ │ │ ISION │ │ │ │ │ │ │ │ │ │ │ │ │ │ SELLERPL │ 0 │ 0 │ 314 │ 7127 │ -1 │ -1 │ 3 │ 82 │ 4000000 │ ▇ │ │ │ │ ACE_AREA │ │ │ │ │ │ │ │ │ │ │ │ │ │ CNT_PAYM │ 372230 │ 22.29 │ 16.05 │ 14.57 │ 0 │ 6 │ 12 │ 24 │ 84 │ ▇▂▁▁ │ │ │ │ ENT │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_FIR │ 673065 │ 40.3 │ 342200 │ 88920 │ -2922 │ 365200 │ 365200 │ 365200 │ 365200 │ ▁ ▇ │ │ │ │ ST_DRAWI │ │ │ │ │ │ │ │ │ │ │ │ │ │ NG │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_FIR │ 673065 │ 40.3 │ 13830 │ 72440 │ -2892 │ -1628 │ -831 │ -411 │ 365200 │ ▇ │ │ │ │ ST_DUE │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_LAS │ 673065 │ 40.3 │ 33770 │ 106900 │ -2801 │ -1242 │ -361 │ 129 │ 365200 │ ▇ ▁ │ │ │ │ T_DUE_1S │ │ │ │ │ │ │ │ │ │ │ │ │ │ T_VERSIO │ │ │ │ │ │ │ │ │ │ │ │ │ │ N │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_LAS │ 673065 │ 40.3 │ 76580 │ 149600 │ -2889 │ -1314 │ -537 │ -74 │ 365200 │ ▇ ▂ │ │ │ │ T_DUE │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_TER │ 673065 │ 40.3 │ 81990 │ 153300 │ -2874 │ -1270 │ -499 │ -44 │ 365200 │ ▇ ▂ │ │ │ │ MINATION │ │ │ │ │ │ │ │ │ │ │ │ │ │ NFLAG_IN │ 673065 │ 40.3 │ 0.3326 │ 0.4711 │ 0 │ 0 │ 0 │ 1 │ 1 │ ▇ ▃ │ │ │ │ SURED_ON │ │ │ │ │ │ │ │ │ │ │ │ │ │ _APPROVA │ │ │ │ │ │ │ │ │ │ │ │ │ │ L │ │ │ │ │ │ │ │ │ │ │ │ │ └──────────┴─────────┴───────┴─────────┴─────────┴──────────┴─────────┴─────────┴─────────┴─────────┴────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩ │ │ │ NAME_CONTRACT_TYPE │ 0 │ 0 │ 2 │ 3340082 │ │ │ │ WEEKDAY_APPR_PROCESS_START │ 0 │ 0 │ 1 │ 1670214 │ │ │ │ FLAG_LAST_APPL_PER_CONTRACT │ 0 │ 0 │ 1 │ 1670214 │ │ │ │ NAME_CASH_LOAN_PURPOSE │ 0 │ 0 │ 1 │ 1725981 │ │ │ │ NAME_CONTRACT_STATUS │ 0 │ 0 │ 1 │ 1696650 │ │ │ │ NAME_PAYMENT_TYPE │ 0 │ 0 │ 2.9 │ 4801959 │ │ │ │ CODE_REJECT_REASON │ 0 │ 0 │ 1 │ 1670214 │ │ │ │ NAME_TYPE_SUITE │ 820405 │ 49.12 │ 0.55 │ 921358 │ │ │ │ NAME_CLIENT_TYPE │ 0 │ 0 │ 1 │ 1670214 │ │ │ │ NAME_GOODS_CATEGORY │ 0 │ 0 │ 1.2 │ 1959050 │ │ │ │ NAME_PORTFOLIO │ 0 │ 0 │ 1 │ 1670214 │ │ │ │ NAME_PRODUCT_TYPE │ 0 │ 0 │ 1 │ 1670214 │ │ │ │ CHANNEL_TYPE │ 0 │ 0 │ 2.5 │ 4251465 │ │ │ │ NAME_SELLER_INDUSTRY │ 0 │ 0 │ 1.2 │ 2074684 │ │ │ │ NAME_YIELD_GROUP │ 0 │ 0 │ 1 │ 1670214 │ │ │ │ PRODUCT_COMBINATION │ 346 │ 0.02 │ 3 │ 4973611 │ │ │ └────────────────────────────────────┴─────────────┴───────────┴────────────────────────┴────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
This file has ~1.0M records. Understandably, there’s more records than applications due to monthly summaries and some customers having multiple previous loans.
Data Types
- A few variables are mapped as continuous but need to be moved to discrete (possibly categorical)
SELLERPLACE_AREA,NFLAG_INSURED_ON_APPROVAL,NFLAG_MICRO_CASH,NFLAG_LAST_APPL_IN_DAY,FLAG_LAST_APPL_PER_CONTRACT
- A few variables mapped as
stringmay be better suited as categoricalNAME_CONTRACT_TYPE,NAME_CONTRACT_STATUS,NAME_PAYMENT_TYPE,NAME_CLIENT_TYPE,NAME_GOODS_CATEGORY, etc.
Missing Data
- There are several missing data points
AMT_ANNUITYrefers to the previous application; if there was none or its balance is zero, makes sense to be missingAMT_DOWN_PAYMENTwould be zero if there was none madeRATE_DOWN_PAYMENThow much was put down relative to the loan (see above)RATE_INTEREST_PRIMARY&RATE_INTEREST_PRIVILEGED; since this is applicationss, it could be there was previous credit- A high rate of missing data (99%)
CNT_PAYMENThow far into a previous loan for the current oneDAYS_FIRST_*refers to the first dispersement, due amount, etc; if no loan was approved, makes sense to be blankNAME_TYPE_SUITEis the only discrete field with high NAs (49%); it indicates who accompanied the client in for the application
Distribution
- Only two right-skewed data points I can see
DAYS_DECISIONandRATE_INTEREST_PRIVILEGED(virtually no data on this one)
Potential feature engineering
- Ultimately this will need to be aggregated; ways to represent the data?
- Count of previous applications
- Status/reject rates
- Classify goods, maybe (% reasonable)
installments_payments.csv
Past payment data for each installments of previous credits in Home Credit related to loans in our sample.
These data will link to the current application and to past applications. Some clients will have previous loans, even multiple, while others may have none. We can see payments due and made in this file.
installment_pmt_df = (
duckdb # Use DuckDB's query engine
.sql("SELECT * FROM 'data/installments_payments.csv'") # Read the file
.fetchdf() # Convert to a pandas dataframe
)
skim(installment_pmt_df) # "Skim" the data set╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 13605401 │ │ float64 │ 5 │ │ │ │ Number of columns │ 8 │ │ int32 │ 3 │ │ │ └───────────────────┴──────────┘ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩ │ │ │ SK_ID_PREV │ 0 │ 0 │ 1903000 │ 536200 │ 1000000 │ 1434000 │ 1897000 │ 2369000 │ 2843000 │ ▇▇▇▇▇▇ │ │ │ │ SK_ID_CURR │ 0 │ 0 │ 278400 │ 102700 │ 100000 │ 189600 │ 278700 │ 367500 │ 456300 │ ▇▇▇▇▇▇ │ │ │ │ NUM_INSTALMENT │ 0 │ 0 │ 0.8566 │ 1.035 │ 0 │ 0 │ 1 │ 1 │ 178 │ ▇ │ │ │ │ _VERSION │ │ │ │ │ │ │ │ │ │ │ │ │ │ NUM_INSTALMENT │ 0 │ 0 │ 18.87 │ 26.66 │ 1 │ 4 │ 8 │ 19 │ 277 │ ▇▁ │ │ │ │ _NUMBER │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_INSTALMEN │ 0 │ 0 │ -1042 │ 800.9 │ -2922 │ -1654 │ -818 │ -361 │ -1 │ ▂▃▃▃▅▇ │ │ │ │ T │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_ENTRY_PAY │ 2905 │ 0.02 │ -1051 │ 800.6 │ -4921 │ -1662 │ -827 │ -370 │ -1 │ ▁▃▃▇ │ │ │ │ MENT │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_INSTALMENT │ 0 │ 0 │ 17050 │ 50570 │ 0 │ 4226 │ 8884 │ 16710 │ 3771000 │ ▇ │ │ │ │ AMT_PAYMENT │ 2905 │ 0.02 │ 17240 │ 54740 │ 0 │ 3398 │ 8126 │ 16110 │ 3771000 │ ▇ │ │ │ └────────────────┴──────┴──────┴─────────┴────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
This file has ~1.4M records.
Data Types
- We’ve got two ID columns that aren’t really continuous:
SK_ID_PREVandSK_ID_CURR - All other variable data types look fine
Missing Data
DAYS_ENTR_PAYMENTandAMT_PAYMENTare mostly complete. The few missing values must be unpaid amounts, may due to default but perhaps due to the cutoff date
Distribution
DAYS_INSTALMENTseems to be skewed to the right
Potential feature engineering
- We could probably synthesize this down to the installment payment timing
POS_CASH_balance.csv
Monthly balance of client’s previous loans in Home Credit
These data will link to the application and to past payment installments. Some clients will have previous loans, even multiple, while other may have none.
cash_bal_df = (
duckdb # Use DuckDB's query engine
.sql("SELECT * FROM 'data/POS_CASH_balance.csv'") # Read the file
.fetchdf() # Convert to a pandas dataframe
)
skim(cash_bal_df) # "Skim" the data set╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 10001358 │ │ int32 │ 5 │ │ │ │ Number of columns │ 8 │ │ float64 │ 2 │ │ │ └───────────────────┴──────────┘ │ string │ 1 │ │ │ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩ │ │ │ SK_ID_PREV │ 0 │ 0 │ 1903000 │ 535800 │ 1000000 │ 1434000 │ 1897000 │ 2369000 │ 2843000 │ ▇▇▇▇▇▇ │ │ │ │ SK_ID_CURR │ 0 │ 0 │ 278400 │ 102800 │ 100000 │ 189600 │ 278700 │ 367400 │ 456300 │ ▇▇▇▇▇▇ │ │ │ │ MONTHS_BALANC │ 0 │ 0 │ -35.01 │ 26.07 │ -96 │ -54 │ -28 │ -13 │ -1 │ ▂▃▃▃▆▇ │ │ │ │ E │ │ │ │ │ │ │ │ │ │ │ │ │ │ CNT_INSTALMEN │ 26071 │ 0.26 │ 17.09 │ 12 │ 1 │ 10 │ 12 │ 24 │ 92 │ ▇▃▁▁ │ │ │ │ T │ │ │ │ │ │ │ │ │ │ │ │ │ │ CNT_INSTALMEN │ 26087 │ 0.26 │ 10.48 │ 11.11 │ 0 │ 3 │ 7 │ 14 │ 85 │ ▇▂▁ │ │ │ │ T_FUTURE │ │ │ │ │ │ │ │ │ │ │ │ │ │ SK_DPD │ 0 │ 0 │ 11.61 │ 132.7 │ 0 │ 0 │ 0 │ 0 │ 4231 │ ▇ │ │ │ │ SK_DPD_DEF │ 0 │ 0 │ 0.6545 │ 32.76 │ 0 │ 0 │ 0 │ 0 │ 3595 │ ▇ │ │ │ └───────────────┴───────┴──────┴─────────┴────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ NAME_CONTRACT_STATUS │ 0 │ 0 │ 1 │ 10018377 │ │ │ └───────────────────────────────────────┴───────┴───────────┴──────────────────────────┴─────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
This file has ~1.0M records. Understandably, there’s more records than applications due to monthly summaries and some customers having multiple previous loans.
Data Types
- We’ve got two ID columns that aren’t really continuous:
SK_ID_PREVandSK_ID_CURR - There’s opportunity the
NAME_CONTRACT_STATUSvariable to cast as acategorical
Missing Data
CNT_INSTALLMENTandCNT_INSTALLMENT_FUTUREare mostly complete. Where there exist missing values, it means there’s nothing outstanding in previous loans.
Distribution
MONTHS_BLANCEseems to be skewed to the right
Potential feature engineering
- It’s possible this dataset can be represented in 2 variables:
- Existence of previous loan: boolean
- Median oustanding balance: numeric
credit_card_balance.csv
Monthly balance of client’s previous credit card loans in Home Credit
These data are all about credit card loans with Home Credit, specifically the balance that is carried.
credit_card_df = (
duckdb # Use DuckDB's query engine
.sql("SELECT * FROM 'data/credit_card_balance.csv'") # Read the file
.fetchdf() # Convert to a pandas dataframe
)
skim(credit_card_df) # "Skim" the data set╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 3840312 │ │ float64 │ 15 │ │ │ │ Number of columns │ 23 │ │ int32 │ 7 │ │ │ └───────────────────┴─────────┘ │ string │ 1 │ │ │ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓ │ │ ┃ column_na ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ │ │ ┃ me ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩ │ │ │ SK_ID_PRE │ 0 │ 0 │ 1905000 │ 536500 │ 1000000 │ 1434000 │ 1897000 │ 2369000 │ 2843000 │ ▇▇▇▇▇▇ │ │ │ │ V │ │ │ │ │ │ │ │ │ │ │ │ │ │ SK_ID_CUR │ 0 │ 0 │ 278300 │ 102700 │ 100000 │ 189500 │ 278400 │ 367600 │ 456200 │ ▇▇▇▇▇▇ │ │ │ │ R │ │ │ │ │ │ │ │ │ │ │ │ │ │ MONTHS_BA │ 0 │ 0 │ -34.52 │ 26.67 │ -96 │ -55 │ -28 │ -11 │ -1 │ ▂▃▃▃▅▇ │ │ │ │ LANCE │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_BALAN │ 0 │ 0 │ 58300 │ 106300 │ -420300 │ 0 │ 0 │ 89050 │ 1506000 │ ▇▁ │ │ │ │ CE │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_CREDI │ 0 │ 0 │ 153800 │ 165100 │ 0 │ 45000 │ 112500 │ 180000 │ 1350000 │ ▇▁▁ │ │ │ │ T_LIMIT_A │ │ │ │ │ │ │ │ │ │ │ │ │ │ CTUAL │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_DRAWI │ 749816 │ 19.52 │ 5961 │ 28230 │ -6827 │ 0 │ 0 │ 0 │ 2115000 │ ▇ │ │ │ │ NGS_ATM_C │ │ │ │ │ │ │ │ │ │ │ │ │ │ URRENT │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_DRAWI │ 0 │ 0 │ 7433 │ 33850 │ -6212 │ 0 │ 0 │ 0 │ 2287000 │ ▇ │ │ │ │ NGS_CURRE │ │ │ │ │ │ │ │ │ │ │ │ │ │ NT │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_DRAWI │ 749816 │ 19.52 │ 288.2 │ 8202 │ 0 │ 0 │ 0 │ 0 │ 1530000 │ ▇ │ │ │ │ NGS_OTHER │ │ │ │ │ │ │ │ │ │ │ │ │ │ _CURRENT │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_DRAWI │ 749816 │ 19.52 │ 2969 │ 20800 │ 0 │ 0 │ 0 │ 0 │ 2239000 │ ▇ │ │ │ │ NGS_POS_C │ │ │ │ │ │ │ │ │ │ │ │ │ │ URRENT │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_INST_ │ 305236 │ 7.95 │ 3540 │ 5600 │ 0 │ 0 │ 0 │ 6634 │ 202900 │ ▇ │ │ │ │ MIN_REGUL │ │ │ │ │ │ │ │ │ │ │ │ │ │ ARITY │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_PAYME │ 767988 │ 20 │ 10280 │ 36080 │ 0 │ 152.4 │ 2703 │ 9000 │ 4289000 │ ▇ │ │ │ │ NT_CURREN │ │ │ │ │ │ │ │ │ │ │ │ │ │ T │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_PAYME │ 0 │ 0 │ 7589 │ 32010 │ 0 │ 0 │ 0 │ 6750 │ 4278000 │ ▇ │ │ │ │ NT_TOTAL_ │ │ │ │ │ │ │ │ │ │ │ │ │ │ CURRENT │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_RECEI │ 0 │ 0 │ 55970 │ 102500 │ -423300 │ 0 │ 0 │ 85360 │ 1472000 │ ▇▁ │ │ │ │ VABLE_PRI │ │ │ │ │ │ │ │ │ │ │ │ │ │ NCIPAL │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_RECIV │ 0 │ 0 │ 58090 │ 106000 │ -420300 │ 0 │ 0 │ 88900 │ 1493000 │ ▇▁ │ │ │ │ ABLE │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_TOTAL │ 0 │ 0 │ 58100 │ 106000 │ -420300 │ 0 │ 0 │ 88910 │ 1493000 │ ▇▁ │ │ │ │ _RECEIVAB │ │ │ │ │ │ │ │ │ │ │ │ │ │ LE │ │ │ │ │ │ │ │ │ │ │ │ │ │ CNT_DRAWI │ 749816 │ 19.52 │ 0.3094 │ 1.1 │ 0 │ 0 │ 0 │ 0 │ 51 │ ▇ │ │ │ │ NGS_ATM_C │ │ │ │ │ │ │ │ │ │ │ │ │ │ URRENT │ │ │ │ │ │ │ │ │ │ │ │ │ │ CNT_DRAWI │ 0 │ 0 │ 0.7031 │ 3.19 │ 0 │ 0 │ 0 │ 0 │ 165 │ ▇ │ │ │ │ NGS_CURRE │ │ │ │ │ │ │ │ │ │ │ │ │ │ NT │ │ │ │ │ │ │ │ │ │ │ │ │ │ CNT_DRAWI │ 749816 │ 19.52 │ 0.004812 │ 0.08264 │ 0 │ 0 │ 0 │ 0 │ 12 │ ▇ │ │ │ │ NGS_OTHER │ │ │ │ │ │ │ │ │ │ │ │ │ │ _CURRENT │ │ │ │ │ │ │ │ │ │ │ │ │ │ CNT_DRAWI │ 749816 │ 19.52 │ 0.5595 │ 3.241 │ 0 │ 0 │ 0 │ 0 │ 165 │ ▇ │ │ │ │ NGS_POS_C │ │ │ │ │ │ │ │ │ │ │ │ │ │ URRENT │ │ │ │ │ │ │ │ │ │ │ │ │ │ CNT_INSTA │ 305236 │ 7.95 │ 20.83 │ 20.05 │ 0 │ 4 │ 15 │ 32 │ 120 │ ▇▃▂▁ │ │ │ │ LMENT_MAT │ │ │ │ │ │ │ │ │ │ │ │ │ │ URE_CUM │ │ │ │ │ │ │ │ │ │ │ │ │ │ SK_DPD │ 0 │ 0 │ 9.284 │ 97.52 │ 0 │ 0 │ 0 │ 0 │ 3260 │ ▇ │ │ │ │ SK_DPD_DE │ 0 │ 0 │ 0.3316 │ 21.48 │ 0 │ 0 │ 0 │ 0 │ 3260 │ ▇ │ │ │ │ F │ │ │ │ │ │ │ │ │ │ │ │ │ └───────────┴────────┴───────┴──────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ NAME_CONTRACT_STATUS │ 0 │ 0 │ 1 │ 3840825 │ │ │ └───────────────────────────────────────┴───────┴───────────┴──────────────────────────┴─────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
This file has ~3.8M records.
Data Types
- We’ve got two ID columns that aren’t really continuous:
SK_ID_PREVandSK_ID_CURR - There’s opportunity the
NAME_CONTRACT_STATUSvariable to cast as acategorical
Missing Data
- There are several “count” variables with missing date (~20% missing)
CNT_INSTALLMENT_MATURE_CUM,CNT_DRAWINGS_POS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_ATM_CURRENT- I assume much of these are missing given there were now withdrawals for the categories
- Some “amount” variables corresponding to the same categories above missing similar amounts of data
Distribution
CNT_INSTALLMENT_MATURE_CUMis very left skewed (understandable with a cumulative measure)MONTHS_BALANCEis right skewed
Potential feature engineering
- It’s possible this dataset can be represented in 2 variables:
- Withdrawal amounts relative to limit
- Withdrawal counts
Dataset Joins
With a fair understanding of the files we’re working with, let’s explore the central idea to the project: Home Credit wants to improve inclusivity of an underserved population. This group is generally without sufficient (or any) credit history. Let’s see if we can identify those applications via joining the datasets.
No Credit Bureau History
For this look, we need to join application_train.csv and bureau.csv. We’ll then aggregate to find out the distribution of credit history among Home Credit clients.
bureau_history = (
duckdb.sql("""
WITH
STEP1 AS (
SELECT
at.SK_ID_CURR AS app_id
,COUNT(b.SK_ID_CURR) AS bureau_history_cnt
FROM 'data/application_train.csv' at
LEFT JOIN 'data/bureau.csv' b ON b.SK_ID_CURR = at.SK_ID_CURR
GROUP BY ALL
)
SELECT
bureau_history_cnt
,COUNT(*) AS app_cnt
FROM STEP1
GROUP BY ALL
ORDER BY bureau_history_cnt
""").fetchdf()
)
bureau_history.head()| bureau_history_cnt | app_cnt | |
|---|---|---|
| 0 | 0 | 44020 |
| 1 | 1 | 36072 |
| 2 | 2 | 35635 |
| 3 | 3 | 32925 |
| 4 | 4 | 28973 |
I mentioned previously that DuckDB has tremendous syntax and we get our first glimpse in the above code block. Instead of needing to mirror the non-aggregated column names from the SELECT clause down to the GROUP BY clause, using GROUP BY ALL will perform this step for us (documentation). Glorious!
What we’ve done here is isolate the number of applications for which there is no history with the credit credit bureau (44,020 applications). Additionally, there are those with little credit history. How much is little? Let’s generate a summary to get a more true sense of distribution (we know it’s left-skewed) with the describe() method of a Pandas dataframe:
(bureau_history # Take the history aggregated
.assign(
app_freq = lambda df: df['app_cnt'] / sum(df['app_cnt']), # Calculate the percent frequency
adj_history_cnt = lambda df: df['app_freq'] * df['bureau_history_cnt'] # Calculate adjusted count of history
)
['adj_history_cnt'] # Select the new variable
.sum() # Get the total (our new average)
)4.765114093479584
Here we learn that the average number of historical credit bureau records is <= 5. This is the average, so insufficien history is far more likely to be fewer than 3 (< 3) records of history. This, we’d probably say, is the definition of insufficient credit history.
Supplemental History
Now that we’ve identified the underserved population this project cares about, let’s see if we can’t confirm there exists other data points that could help assess credit worthiness.
In theory, Home Credit has previous application data on these individuals that may contribute. Let’s expand on the query above to see:
supplement_history = (
duckdb.sql("""
WITH
STEP1 AS (
SELECT
at.SK_ID_CURR
,COUNT(b.SK_ID_CURR) AS bureau_history_cnt
FROM 'data/application_train.csv' at
LEFT JOIN 'data/bureau.csv' b ON b.SK_ID_CURR = at.SK_ID_CURR
GROUP BY ALL
)
,STEP2 AS (
SELECT
s.SK_ID_CURR AS ap_id
,COUNT(pa.SK_ID_PREV) AS prev_app_history_cnt
FROM STEP1 s
LEFT JOIN 'data/previous_application.csv' pa ON pa.SK_ID_CURR = s.SK_ID_CURR
WHERE 1 = 1
AND bureau_history_cnt <= 5 -- Insufficient credit history
GROUP BY ALL
)
SELECT
prev_app_history_cnt
,COUNT(*) AS app_cnt
FROM STEP2
GROUP BY ALL
ORDER BY prev_app_history_cnt
""").fetchdf()
)
supplement_history.head()| prev_app_history_cnt | app_cnt | |
|---|---|---|
| 0 | 0 | 12547 |
| 1 | 1 | 39345 |
| 2 | 2 | 32357 |
| 3 | 3 | 27080 |
| 4 | 4 | 21688 |
Despite most of this underserved population haing some previous application data with Home Credit we could use to supplement insufficient credit bureau, there’s at least 16,237 original applications without any previous history with Home Credit.
(supplement_history # Take the history aggregated
.assign(
app_freq = lambda df: df['app_cnt'] / sum(df['app_cnt']), # Calculate the percent frequency
adj_history_cnt = lambda df: df['app_freq'] * df['prev_app_history_cnt'] # Calculate adjusted count of history
)
['adj_history_cnt'] # Select the new variable
.sum() # Get the total (our new average)
)4.136952766398499
Again, we see the average number of previous application history with Home Credit to be <= 5. We’ll assume the same rule from above (< 3) for the “insufficient” group.
Insufficient History Options
We’ve established a group that has insufficient history from both the credit bureau and historical Home Credit records. The next step is to re-analyze the application_train.csv file only for this group. Any model we design has to be inclusive to this group and leverage what data exists therein.
Here, we’re restricting the data to our “insufficient” group and we’ll skim again:
insufficient_history = (
duckdb.sql("""
WITH
STEP1 AS (
SELECT
at.SK_ID_CURR
,COUNT(DISTINCT b.SK_ID_BUREAU) AS bureau_history_cnt
,COUNT(DISTINCT pa.SK_ID_PREV) AS hc_history_cnt
FROM 'data/application_train.csv' at
LEFT JOIN 'data/bureau.csv' b ON b.SK_ID_CURR = at.SK_ID_CURR
LEFT JOIN 'data/previous_application.csv' pa ON pa.SK_ID_CURR = at.SK_ID_CURR
GROUP BY
at.SK_ID_CURR
)
SELECT
at.*
FROM 'data/application_train.csv' at
INNER JOIN STEP1 ON STEP1.SK_ID_CURR = at.SK_ID_CURR
WHERE bureau_history_cnt < 3
AND hc_history_cnt < 3
""").fetchdf()
)
skim(insufficient_history)╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 51288 │ │ float64 │ 65 │ │ │ │ Number of columns │ 122 │ │ int32 │ 41 │ │ │ └───────────────────┴────────┘ │ string │ 16 │ │ │ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓ │ │ ┃ column_n ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ │ │ ┃ ame ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩ │ │ │ SK_ID_CU │ 0 │ 0 │ 277100 │ 102600 │ 100000 │ 188400 │ 276200 │ 365300 │ 456300 │ ▇▇▇▇▇▇ │ │ │ │ RR │ │ │ │ │ │ │ │ │ │ │ │ │ │ TARGET │ 0 │ 0 │ 0.08981 │ 0.2859 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ ▁ │ │ │ │ CNT_CHIL │ 0 │ 0 │ 0.3923 │ 0.7015 │ 0 │ 0 │ 0 │ 1 │ 7 │ ▇▁ │ │ │ │ DREN │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_INCO │ 0 │ 0 │ 152900 │ 98990 │ 25650 │ 90000 │ 135000 │ 180000 │ 3375000 │ ▇ │ │ │ │ ME_TOTAL │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_CRED │ 0 │ 0 │ 574100 │ 417000 │ 45000 │ 260600 │ 473800 │ 781900 │ 4050000 │ ▇▃ │ │ │ │ IT │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_ANNU │ 3 │ 0.01 │ 26860 │ 15890 │ 2542 │ 15330 │ 24250 │ 34160 │ 225000 │ ▇▂ │ │ │ │ ITY │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_GOOD │ 58 │ 0.11 │ 516800 │ 383700 │ 45000 │ 225000 │ 450000 │ 675000 │ 4050000 │ ▇▂ │ │ │ │ S_PRICE │ │ │ │ │ │ │ │ │ │ │ │ │ │ REGION_P │ 0 │ 0 │ 0.02162 │ 0.01505 │ 0.000533 │ 0.01001 │ 0.01885 │ 0.02866 │ 0.07251 │ ▇▇▆▁ ▁ │ │ │ │ OPULATIO │ │ │ │ │ │ │ │ │ │ │ │ │ │ N_RELATI │ │ │ │ │ │ │ │ │ │ │ │ │ │ VE │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_BIR │ 0 │ 0 │ -15420 │ 4680 │ -25230 │ -19420 │ -15050 │ -11330 │ -7676 │ ▃▆▇▇▇▇ │ │ │ │ TH │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_EMP │ 0 │ 0 │ 64190 │ 141100 │ -16840 │ -2054 │ -875 │ -220 │ 365200 │ ▇ ▂ │ │ │ │ LOYED │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_REG │ 0 │ 0 │ -4979 │ 3542 │ -23740 │ -7531 │ -4458 │ -1991 │ 0 │ ▁▃▆▇ │ │ │ │ ISTRATIO │ │ │ │ │ │ │ │ │ │ │ │ │ │ N │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_ID_ │ 0 │ 0 │ -2743 │ 1553 │ -6235 │ -4196 │ -2801 │ -1322 │ 0 │ ▇▆▅▆▆ │ │ │ │ PUBLISH │ │ │ │ │ │ │ │ │ │ │ │ │ │ OWN_CAR_ │ 35628 │ 69.47 │ 11.96 │ 11.25 │ 0 │ 5 │ 9 │ 15 │ 65 │ ▇▅▁ │ │ │ │ AGE │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_MOB │ 0 │ 0 │ 1 │ 0.004416 │ 0 │ 1 │ 1 │ 1 │ 1 │ ▇ │ │ │ │ IL │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_EMP │ 0 │ 0 │ 0.8199 │ 0.3843 │ 0 │ 1 │ 1 │ 1 │ 1 │ ▂ ▇ │ │ │ │ _PHONE │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_WOR │ 0 │ 0 │ 0.2177 │ 0.4127 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ ▂ │ │ │ │ K_PHONE │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_CON │ 0 │ 0 │ 0.9971 │ 0.05364 │ 0 │ 1 │ 1 │ 1 │ 1 │ ▇ │ │ │ │ T_MOBILE │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_PHO │ 0 │ 0 │ 0.2553 │ 0.4361 │ 0 │ 0 │ 0 │ 1 │ 1 │ ▇ ▃ │ │ │ │ NE │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_EMA │ 0 │ 0 │ 0.0388 │ 0.1931 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ IL │ │ │ │ │ │ │ │ │ │ │ │ │ │ CNT_FAM_ │ 2 │ 0 │ 2.085 │ 0.9133 │ 1 │ 1 │ 2 │ 2 │ 9 │ ▇▂▁ │ │ │ │ MEMBERS │ │ │ │ │ │ │ │ │ │ │ │ │ │ REGION_R │ 0 │ 0 │ 2.015 │ 0.5313 │ 1 │ 2 │ 2 │ 2 │ 3 │ ▁ ▇ ▂ │ │ │ │ ATING_CL │ │ │ │ │ │ │ │ │ │ │ │ │ │ IENT │ │ │ │ │ │ │ │ │ │ │ │ │ │ REGION_R │ 0 │ 0 │ 1.995 │ 0.5245 │ 1 │ 2 │ 2 │ 2 │ 3 │ ▂ ▇ ▁ │ │ │ │ ATING_CL │ │ │ │ │ │ │ │ │ │ │ │ │ │ IENT_W_C │ │ │ │ │ │ │ │ │ │ │ │ │ │ ITY │ │ │ │ │ │ │ │ │ │ │ │ │ │ HOUR_APP │ 0 │ 0 │ 12.21 │ 3.298 │ 0 │ 10 │ 12 │ 15 │ 23 │ ▁▇▇▃ │ │ │ │ R_PROCES │ │ │ │ │ │ │ │ │ │ │ │ │ │ S_START │ │ │ │ │ │ │ │ │ │ │ │ │ │ REG_REGI │ 0 │ 0 │ 0.02427 │ 0.1539 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ ON_NOT_L │ │ │ │ │ │ │ │ │ │ │ │ │ │ IVE_REGI │ │ │ │ │ │ │ │ │ │ │ │ │ │ ON │ │ │ │ │ │ │ │ │ │ │ │ │ │ REG_REGI │ 0 │ 0 │ 0.06543 │ 0.2473 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ ▁ │ │ │ │ ON_NOT_W │ │ │ │ │ │ │ │ │ │ │ │ │ │ ORK_REGI │ │ │ │ │ │ │ │ │ │ │ │ │ │ ON │ │ │ │ │ │ │ │ │ │ │ │ │ │ LIVE_REG │ 0 │ 0 │ 0.04816 │ 0.2141 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ ION_NOT_ │ │ │ │ │ │ │ │ │ │ │ │ │ │ WORK_REG │ │ │ │ │ │ │ │ │ │ │ │ │ │ ION │ │ │ │ │ │ │ │ │ │ │ │ │ │ REG_CITY │ 0 │ 0 │ 0.09774 │ 0.297 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ ▁ │ │ │ │ _NOT_LIV │ │ │ │ │ │ │ │ │ │ │ │ │ │ E_CITY │ │ │ │ │ │ │ │ │ │ │ │ │ │ REG_CITY │ 0 │ 0 │ 0.2582 │ 0.4377 │ 0 │ 0 │ 0 │ 1 │ 1 │ ▇ ▃ │ │ │ │ _NOT_WOR │ │ │ │ │ │ │ │ │ │ │ │ │ │ K_CITY │ │ │ │ │ │ │ │ │ │ │ │ │ │ LIVE_CIT │ 0 │ 0 │ 0.1921 │ 0.3939 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ ▂ │ │ │ │ Y_NOT_WO │ │ │ │ │ │ │ │ │ │ │ │ │ │ RK_CITY │ │ │ │ │ │ │ │ │ │ │ │ │ │ EXT_SOUR │ 30789 │ 60.03 │ 0.4715 │ 0.2159 │ 0.0195 │ 0.2937 │ 0.4636 │ 0.6457 │ 0.9461 │ ▃▇▇▇▆▃ │ │ │ │ CE_1 │ │ │ │ │ │ │ │ │ │ │ │ │ │ EXT_SOUR │ 206 │ 0.4 │ 0.5056 │ 0.1964 │ 1.316e-0 │ 0.3703 │ 0.5563 │ 0.6623 │ 0.855 │ ▁▃▃▅▇▃ │ │ │ │ CE_2 │ │ │ │ │ 6 │ │ │ │ │ │ │ │ │ EXT_SOUR │ 24859 │ 48.47 │ 0.549 │ 0.1928 │ 0.000527 │ 0.4084 │ 0.5815 │ 0.7047 │ 0.894 │ ▁▂▅▆▇▅ │ │ │ │ CE_3 │ │ │ │ │ 3 │ │ │ │ │ │ │ │ │ APARTMEN │ 27024 │ 52.69 │ 0.1207 │ 0.1125 │ 0 │ 0.0577 │ 0.0897 │ 0.1485 │ 1 │ ▇▂ │ │ │ │ TS_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ BASEMENT │ 30871 │ 60.19 │ 0.08953 │ 0.08507 │ 0 │ 0.0442 │ 0.0765 │ 0.113 │ 1 │ ▇▁ │ │ │ │ AREA_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ YEARS_BE │ 25972 │ 50.64 │ 0.9777 │ 0.05906 │ 0 │ 0.9767 │ 0.9819 │ 0.9871 │ 1 │ ▇ │ │ │ │ GINEXPLU │ │ │ │ │ │ │ │ │ │ │ │ │ │ ATATION_ │ │ │ │ │ │ │ │ │ │ │ │ │ │ AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ YEARS_BU │ 34819 │ 67.89 │ 0.7536 │ 0.1137 │ 0 │ 0.6872 │ 0.7552 │ 0.8232 │ 1 │ ▂▇▃ │ │ │ │ ILD_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ COMMONAR │ 36530 │ 71.23 │ 0.04638 │ 0.08186 │ 0 │ 0.0077 │ 0.0211 │ 0.0527 │ 1 │ ▇ │ │ │ │ EA_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ ELEVATOR │ 28188 │ 54.96 │ 0.08437 │ 0.1405 │ 0 │ 0 │ 0 │ 0.14 │ 1 │ ▇▁ │ │ │ │ S_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ ENTRANCE │ 26759 │ 52.17 │ 0.1494 │ 0.1016 │ 0 │ 0.069 │ 0.1379 │ 0.2069 │ 1 │ ▇▃ │ │ │ │ S_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLOORSMA │ 26476 │ 51.62 │ 0.2335 │ 0.1534 │ 0 │ 0.1667 │ 0.1667 │ 0.3333 │ 1 │ ▃▇▁ │ │ │ │ X_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLOORSMI │ 35494 │ 69.21 │ 0.2386 │ 0.1691 │ 0 │ 0.0833 │ 0.2083 │ 0.375 │ 1 │ ▆▇▅▁▁ │ │ │ │ N_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ LANDAREA │ 31305 │ 61.04 │ 0.06611 │ 0.08361 │ 0 │ 0.0173 │ 0.0473 │ 0.0854 │ 1 │ ▇ │ │ │ │ _AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ LIVINGAP │ 35741 │ 69.69 │ 0.1044 │ 0.09666 │ 0 │ 0.0504 │ 0.0756 │ 0.1275 │ 1 │ ▇▁ │ │ │ │ ARTMENTS │ │ │ │ │ │ │ │ │ │ │ │ │ │ _AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ LIVINGAR │ 26694 │ 52.05 │ 0.1109 │ 0.1151 │ 0 │ 0.0453 │ 0.0756 │ 0.1385 │ 1 │ ▇▁ │ │ │ │ EA_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ NONLIVIN │ 36271 │ 70.72 │ 0.009464 │ 0.05139 │ 0 │ 0 │ 0 │ 0.0053 │ 1 │ ▇ │ │ │ │ GAPARTME │ │ │ │ │ │ │ │ │ │ │ │ │ │ NTS_AVG │ │ │ │ │ │ │ │ │ │ │ │ │ │ NONLIVIN │ 29200 │ 56.93 │ 0.03025 │ 0.07467 │ 0 │ 0 │ 0.004 │ 0.0301 │ 1 │ ▇ │ │ │ │ GAREA_AV │ │ │ │ │ │ │ │ │ │ │ │ │ │ G │ │ │ │ │ │ │ │ │ │ │ │ │ │ APARTMEN │ 27024 │ 52.69 │ 0.1171 │ 0.1119 │ 0 │ 0.0525 │ 0.084 │ 0.1492 │ 1 │ ▇▂ │ │ │ │ TS_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ BASEMENT │ 30871 │ 60.19 │ 0.0883 │ 0.08704 │ 0 │ 0.04 │ 0.0747 │ 0.1132 │ 1 │ ▇▁ │ │ │ │ AREA_MOD │ │ │ │ │ │ │ │ │ │ │ │ │ │ E │ │ │ │ │ │ │ │ │ │ │ │ │ │ YEARS_BE │ 25972 │ 50.64 │ 0.9771 │ 0.06404 │ 0 │ 0.9767 │ 0.9816 │ 0.9866 │ 1 │ ▇ │ │ │ │ GINEXPLU │ │ │ │ │ │ │ │ │ │ │ │ │ │ ATATION_ │ │ │ │ │ │ │ │ │ │ │ │ │ │ MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ YEARS_BU │ 34819 │ 67.89 │ 0.7607 │ 0.1104 │ 0 │ 0.6994 │ 0.7648 │ 0.8301 │ 1 │ ▂▇▃ │ │ │ │ ILD_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ COMMONAR │ 36530 │ 71.23 │ 0.04395 │ 0.08004 │ 0 │ 0.0069 │ 0.0191 │ 0.0497 │ 1 │ ▇ │ │ │ │ EA_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ ELEVATOR │ 28188 │ 54.96 │ 0.07932 │ 0.1374 │ 0 │ 0 │ 0 │ 0.1208 │ 1 │ ▇▁ │ │ │ │ S_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ ENTRANCE │ 26759 │ 52.17 │ 0.1445 │ 0.1025 │ 0 │ 0.069 │ 0.1379 │ 0.2069 │ 1 │ ▇▃ │ │ │ │ S_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLOORSMA │ 26476 │ 51.62 │ 0.2289 │ 0.1518 │ 0 │ 0.1667 │ 0.1667 │ 0.3333 │ 1 │ ▃▇▁ │ │ │ │ X_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLOORSMI │ 35494 │ 69.21 │ 0.2341 │ 0.1689 │ 0 │ 0.0833 │ 0.2083 │ 0.375 │ 1 │ ▆▇▅▁▁ │ │ │ │ N_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ LANDAREA │ 31305 │ 61.04 │ 0.06452 │ 0.08361 │ 0 │ 0.01515 │ 0.0447 │ 0.0834 │ 1 │ ▇ │ │ │ │ _MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ LIVINGAP │ 35741 │ 69.69 │ 0.1093 │ 0.1023 │ 0 │ 0.0542 │ 0.079 │ 0.1322 │ 1 │ ▇▁ │ │ │ │ ARTMENTS │ │ │ │ │ │ │ │ │ │ │ │ │ │ _MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ LIVINGAR │ 26694 │ 52.05 │ 0.1092 │ 0.1163 │ 0 │ 0.04262 │ 0.0735 │ 0.131 │ 1 │ ▇▁ │ │ │ │ EA_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ NONLIVIN │ 36271 │ 70.72 │ 0.00863 │ 0.04937 │ 0 │ 0 │ 0 │ 0.0039 │ 1 │ ▇ │ │ │ │ GAPARTME │ │ │ │ │ │ │ │ │ │ │ │ │ │ NTS_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ NONLIVIN │ 29200 │ 56.93 │ 0.02863 │ 0.07492 │ 0 │ 0 │ 0.0013 │ 0.0247 │ 1 │ ▇ │ │ │ │ GAREA_MO │ │ │ │ │ │ │ │ │ │ │ │ │ │ DE │ │ │ │ │ │ │ │ │ │ │ │ │ │ APARTMEN │ 27024 │ 52.69 │ 0.1211 │ 0.1133 │ 0 │ 0.0583 │ 0.08875 │ 0.1499 │ 1 │ ▇▂ │ │ │ │ TS_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ BASEMENT │ 30871 │ 60.19 │ 0.08905 │ 0.08487 │ 0 │ 0.0438 │ 0.0761 │ 0.1126 │ 1 │ ▇▁ │ │ │ │ AREA_MED │ │ │ │ │ │ │ │ │ │ │ │ │ │ I │ │ │ │ │ │ │ │ │ │ │ │ │ │ YEARS_BE │ 25972 │ 50.64 │ 0.9777 │ 0.05961 │ 0 │ 0.9767 │ 0.9816 │ 0.9871 │ 1 │ ▇ │ │ │ │ GINEXPLU │ │ │ │ │ │ │ │ │ │ │ │ │ │ ATATION_ │ │ │ │ │ │ │ │ │ │ │ │ │ │ MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ YEARS_BU │ 34819 │ 67.89 │ 0.7569 │ 0.1125 │ 0 │ 0.6914 │ 0.7585 │ 0.8256 │ 1 │ ▂▇▃ │ │ │ │ ILD_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ COMMONAR │ 36530 │ 71.23 │ 0.0463 │ 0.08155 │ 0 │ 0.0076 │ 0.0209 │ 0.05267 │ 1 │ ▇ │ │ │ │ EA_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ ELEVATOR │ 28188 │ 54.96 │ 0.08353 │ 0.1405 │ 0 │ 0 │ 0 │ 0.14 │ 1 │ ▇▁ │ │ │ │ S_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ ENTRANCE │ 26759 │ 52.17 │ 0.149 │ 0.1019 │ 0 │ 0.069 │ 0.1379 │ 0.2069 │ 1 │ ▇▃ │ │ │ │ S_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLOORSMA │ 26476 │ 51.62 │ 0.2332 │ 0.1537 │ 0 │ 0.1667 │ 0.1667 │ 0.3333 │ 1 │ ▃▇▁ │ │ │ │ X_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLOORSMI │ 35494 │ 69.21 │ 0.2383 │ 0.1698 │ 0 │ 0.0833 │ 0.2083 │ 0.375 │ 1 │ ▆▇▅▁▁ │ │ │ │ N_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ LANDAREA │ 31305 │ 61.04 │ 0.06697 │ 0.08485 │ 0 │ 0.0173 │ 0.0477 │ 0.0866 │ 1 │ ▇▁ │ │ │ │ _MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ LIVINGAP │ 35741 │ 69.69 │ 0.1056 │ 0.09756 │ 0 │ 0.0513 │ 0.077 │ 0.1274 │ 1 │ ▇▁ │ │ │ │ ARTMENTS │ │ │ │ │ │ │ │ │ │ │ │ │ │ _MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ LIVINGAR │ 26694 │ 52.05 │ 0.1122 │ 0.117 │ 0 │ 0.0458 │ 0.0763 │ 0.1389 │ 1 │ ▇▁ │ │ │ │ EA_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ NONLIVIN │ 36271 │ 70.72 │ 0.009239 │ 0.05063 │ 0 │ 0 │ 0 │ 0.0039 │ 1 │ ▇ │ │ │ │ GAPARTME │ │ │ │ │ │ │ │ │ │ │ │ │ │ NTS_MEDI │ │ │ │ │ │ │ │ │ │ │ │ │ │ NONLIVIN │ 29200 │ 56.93 │ 0.03012 │ 0.07527 │ 0 │ 0 │ 0.0035 │ 0.029 │ 1 │ ▇ │ │ │ │ GAREA_ME │ │ │ │ │ │ │ │ │ │ │ │ │ │ DI │ │ │ │ │ │ │ │ │ │ │ │ │ │ TOTALARE │ 25713 │ 50.13 │ 0.1057 │ 0.1124 │ 0 │ 0.0412 │ 0.0695 │ 0.1361 │ 1 │ ▇▂ │ │ │ │ A_MODE │ │ │ │ │ │ │ │ │ │ │ │ │ │ OBS_30_C │ 331 │ 0.65 │ 1.276 │ 2.167 │ 0 │ 0 │ 0 │ 2 │ 27 │ ▇▁ │ │ │ │ NT_SOCIA │ │ │ │ │ │ │ │ │ │ │ │ │ │ L_CIRCLE │ │ │ │ │ │ │ │ │ │ │ │ │ │ DEF_30_C │ 331 │ 0.65 │ 0.1326 │ 0.4237 │ 0 │ 0 │ 0 │ 0 │ 6 │ ▇▁ │ │ │ │ NT_SOCIA │ │ │ │ │ │ │ │ │ │ │ │ │ │ L_CIRCLE │ │ │ │ │ │ │ │ │ │ │ │ │ │ OBS_60_C │ 331 │ 0.65 │ 1.26 │ 2.147 │ 0 │ 0 │ 0 │ 2 │ 27 │ ▇▁ │ │ │ │ NT_SOCIA │ │ │ │ │ │ │ │ │ │ │ │ │ │ L_CIRCLE │ │ │ │ │ │ │ │ │ │ │ │ │ │ DEF_60_C │ 331 │ 0.65 │ 0.09169 │ 0.3424 │ 0 │ 0 │ 0 │ 0 │ 5 │ ▇▁ │ │ │ │ NT_SOCIA │ │ │ │ │ │ │ │ │ │ │ │ │ │ L_CIRCLE │ │ │ │ │ │ │ │ │ │ │ │ │ │ DAYS_LAS │ 1 │ 0 │ -660.1 │ 750 │ -4292 │ -876 │ -436 │ -47 │ 0 │ ▁▁▂▇ │ │ │ │ T_PHONE_ │ │ │ │ │ │ │ │ │ │ │ │ │ │ CHANGE │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 1.95e-05 │ 0.004416 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_2 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.6386 │ 0.4804 │ 0 │ 0 │ 1 │ 1 │ 1 │ ▅ ▇ │ │ │ │ UMENT_3 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 5.849e-0 │ 0.007648 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_4 │ │ │ 5 │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.01538 │ 0.1231 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_5 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.09546 │ 0.2939 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ ▁ │ │ │ │ UMENT_6 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.000370 │ 0.01924 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_7 │ │ │ 5 │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.08343 │ 0.2765 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ ▁ │ │ │ │ UMENT_8 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.004835 │ 0.06937 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_9 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ ▇ │ │ │ │ UMENT_10 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.01037 │ 0.1013 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_11 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ ▇ │ │ │ │ UMENT_12 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.004972 │ 0.07034 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_13 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.004738 │ 0.06867 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_14 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.00195 │ 0.04411 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_15 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.01503 │ 0.1217 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_16 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.00039 │ 0.01974 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_17 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.01322 │ 0.1142 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_18 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.000799 │ 0.02826 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_19 │ │ │ 4 │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.000273 │ 0.01652 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_20 │ │ │ │ │ │ │ │ │ │ │ │ │ │ FLAG_DOC │ 0 │ 0 │ 0.000682 │ 0.02611 │ 0 │ 0 │ 0 │ 0 │ 1 │ ▇ │ │ │ │ UMENT_21 │ │ │ 4 │ │ │ │ │ │ │ │ │ │ │ AMT_REQ_ │ 18591 │ 36.25 │ 0.00682 │ 0.087 │ 0 │ 0 │ 0 │ 0 │ 3 │ ▇ │ │ │ │ CREDIT_B │ │ │ │ │ │ │ │ │ │ │ │ │ │ UREAU_HO │ │ │ │ │ │ │ │ │ │ │ │ │ │ UR │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_REQ_ │ 18591 │ 36.25 │ 0.008227 │ 0.1321 │ 0 │ 0 │ 0 │ 0 │ 8 │ ▇ │ │ │ │ CREDIT_B │ │ │ │ │ │ │ │ │ │ │ │ │ │ UREAU_DA │ │ │ │ │ │ │ │ │ │ │ │ │ │ Y │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_REQ_ │ 18591 │ 36.25 │ 0.0259 │ 0.1865 │ 0 │ 0 │ 0 │ 0 │ 6 │ ▇ │ │ │ │ CREDIT_B │ │ │ │ │ │ │ │ │ │ │ │ │ │ UREAU_WE │ │ │ │ │ │ │ │ │ │ │ │ │ │ EK │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_REQ_ │ 18591 │ 36.25 │ 0.1259 │ 0.5168 │ 0 │ 0 │ 0 │ 0 │ 15 │ ▇ │ │ │ │ CREDIT_B │ │ │ │ │ │ │ │ │ │ │ │ │ │ UREAU_MO │ │ │ │ │ │ │ │ │ │ │ │ │ │ N │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_REQ_ │ 18591 │ 36.25 │ 0.1567 │ 0.4538 │ 0 │ 0 │ 0 │ 0 │ 19 │ ▇ │ │ │ │ CREDIT_B │ │ │ │ │ │ │ │ │ │ │ │ │ │ UREAU_QR │ │ │ │ │ │ │ │ │ │ │ │ │ │ T │ │ │ │ │ │ │ │ │ │ │ │ │ │ AMT_REQ_ │ 18591 │ 36.25 │ 0.7666 │ 0.9851 │ 0 │ 0 │ 0 │ 1 │ 9 │ ▇▁▁ │ │ │ │ CREDIT_B │ │ │ │ │ │ │ │ │ │ │ │ │ │ UREAU_YE │ │ │ │ │ │ │ │ │ │ │ │ │ │ AR │ │ │ │ │ │ │ │ │ │ │ │ │ └──────────┴───────┴───────┴──────────┴──────────┴──────────┴─────────┴─────────┴─────────┴─────────┴────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩ │ │ │ NAME_CONTRACT_TYPE │ 0 │ 0 │ 2 │ 102576 │ │ │ │ CODE_GENDER │ 0 │ 0 │ 1 │ 51288 │ │ │ │ FLAG_OWN_CAR │ 0 │ 0 │ 1 │ 51288 │ │ │ │ FLAG_OWN_REALTY │ 0 │ 0 │ 1 │ 51288 │ │ │ │ NAME_TYPE_SUITE │ 320 │ 0.62 │ 1 │ 53444 │ │ │ │ NAME_INCOME_TYPE │ 0 │ 0 │ 1.3 │ 66926 │ │ │ │ NAME_EDUCATION_TYPE │ 0 │ 0 │ 3.4 │ 174134 │ │ │ │ NAME_FAMILY_STATUS │ 0 │ 0 │ 1.7 │ 87595 │ │ │ │ NAME_HOUSING_TYPE │ 0 │ 0 │ 2.9 │ 147099 │ │ │ │ OCCUPATION_TYPE │ 16214 │ 31.61 │ 1.1 │ 57171 │ │ │ │ WEEKDAY_APPR_PROCESS_START │ 0 │ 0 │ 1 │ 51288 │ │ │ │ ORGANIZATION_TYPE │ 0 │ 0 │ 2.1 │ 107715 │ │ │ │ FONDKAPREMONT_MODE │ 35708 │ 69.62 │ 0.93 │ 47729 │ │ │ │ HOUSETYPE_MODE │ 26685 │ 52.03 │ 1.4 │ 73398 │ │ │ │ WALLSMATERIAL_MODE │ 27053 │ 52.75 │ 0.68 │ 34667 │ │ │ │ EMERGENCYSTATE_MODE │ 25298 │ 49.33 │ 0.51 │ 25990 │ │ │ └────────────────────────────────────┴────────────┴────────────┴────────────────────────┴────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
This leaves us with 51,288 applications in the “insufficient history” category. This means we’re left with the columns on the application_train.csv file as the best souce for predicting default. Let’s get some quick observations on this subset of applications:
- Just as before, most items relating to a dwelling have a high rate of missing data
- I’m curious to see if these data points are even helpful or not
- There’s approximately 43 dwelling related predictors; if we take these and reduce to its principal components, we can test if there’s anything of predictive value in here
- Mean income and variance thereof is lower in this sample versus the entire
application_train.csvfile, while credit amount and other measures are fairly consistent- I’m getting the impression that these applications are, in fact, very similar to the population
- As the prompt said originally, due to insufficient credit history, these often falling victim to overly punitive terms by lenders
Underserved Applications
The crux of this problem is to improve the borrowing experience for this underserved group of applications. Therefore, if we find a pretty good model for this group, we’re likely to be creating a good model overall. Let’s dive into some predictive exploration for this group.
Documents
There’s approximately 20 FLAG_DOCUMENT_# features in the dataset. Let’s see if these are even valuable to us. For that, we’ll reduce to its principal components and then make a simple model.
We’ll load the libraries necessary for PCA:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScalerNow let’s first get our data in the right format. We want 1) all of the FLAG_DOCUMENT_# columns in one dataframe and 2) TARGET in its own.
app_docs = insufficient_history.filter(like='FLAG_DOCUMENT_') # Extract all FLAG_DOCUMENT variables
app_docs_scaled = StandardScaler().fit_transform(app_docs) # Standardize each column
app_target = insufficient_history[['TARGET']] # Get the target variableNow we can reduce these 20 variables to its principal components. Hopefully, a small fraction of those will explain 80-90% of the variance. This would allow us to use fewer variables in a test model. Either way, we’ll then be able to run a simple model to see how predictive these are.
import random
random.seed(814)
pca = PCA(n_components=app_docs_scaled.shape[1]) # Define the PCA
app_docs_pca = pca.fit_transform(app_docs_scaled) # Run the PCA
exp_var = pca.explained_variance_ratio_ # Get the explained variance for each principal component
print(exp_var[0:4]) # How much variance is explained by the first 5 components [0.08759671 0.06570449 0.05878099 0.0574298 ]
These first 5 components are only explaining about 1/3 of the variance. PCA isn’t helping us reduce the volume of variables. Instead, let’s just run a simple, logistic regression model with all these predictors modeling TARGET. I really like the statsmodels package as it facilitates very similar model summaries to R, which is huge for interpretability.
import statsmodels.api as sm
app_docs_mod = sm.Logit(app_target, app_docs_pca).fit() # Fit a logistic regression
print(app_docs_mod.summary()) # Print the model summaryOptimization terminated successfully.
Current function value: 0.692194
Iterations 4
Logit Regression Results
==============================================================================
Dep. Variable: TARGET No. Observations: 51288
Model: Logit Df Residuals: 51270
Method: MLE Df Model: 17
Date: Fri, 11 Oct 2024 Pseudo R-squ.: -1.291
Time: 06:27:32 Log-Likelihood: -35501.
converged: True LL-Null: -15494.
Covariance Type: nonrobust LLR p-value: 1.000
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.0515 0.044 1.174 0.240 -0.035 0.138
x2 -0.0101 97.477 -0.000 1.000 -191.062 191.041
x3 -0.0101 1.16e+04 -8.71e-07 1.000 -2.26e+04 2.26e+04
x4 -0.0128 1.02e+05 -1.26e-07 1.000 -1.99e+05 1.99e+05
x5 0.0159 2.5e+04 6.37e-07 1.000 -4.89e+04 4.89e+04
x6 -0.0073 8.27e+04 -8.86e-08 1.000 -1.62e+05 1.62e+05
x7 0.0018 1.5e+05 1.17e-08 1.000 -2.95e+05 2.95e+05
x8 -0.0107 9.87e+04 -1.08e-07 1.000 -1.94e+05 1.94e+05
x9 -0.0062 1.02e+05 -6.08e-08 1.000 -2e+05 2e+05
x10 -0.0043 1.26e+05 -3.42e-08 1.000 -2.46e+05 2.46e+05
x11 0.0071 4.7e+04 1.51e-07 1.000 -9.22e+04 9.22e+04
x12 -0.0067 9.61e+04 -6.98e-08 1.000 -1.88e+05 1.88e+05
x13 -0.0030 7275.912 -4.07e-07 1.000 -1.43e+04 1.43e+04
x14 0.0108 8.35e+04 1.3e-07 1.000 -1.64e+05 1.64e+05
x15 0.0069 1.91e+04 3.59e-07 1.000 -3.75e+04 3.75e+04
x16 -0.0154 6661.497 -2.31e-06 1.000 -1.31e+04 1.31e+04
x17 -0.0246 82.629 -0.000 1.000 -161.974 161.925
x18 0.0732 0.017 4.417 0.000 0.041 0.106
x19 1.401e-18 1.3e+21 1.08e-39 1.000 -2.55e+21 2.55e+21
x20 2.05e-18 7.07e+20 2.9e-39 1.000 -1.39e+21 1.39e+21
==============================================================================
Clearly, this is a very bad model. We’ve got a negative \(R^2\) and only two of the components are even statistically significant (x1 and x18). We could sum the PCA explained variance from above for those features:
exp_var[0] + exp_var[17] # Explained variance by the only two components that were statistically significant0.10376992731867007
Only 10% of the variance is being explained by these statistically significant components. As is, these won’t be helpful in predicting our target variable. It’s possible a simple field indicating the count of documents submitted is somewhat helpful. But we’ll look at that later.
Dwelling
Let’s use a similar exercise on the variables related to dwelling. We have several versions with different measures of central tendency (average, median, and mode). Let’s go with the median versions. We know there’s a lot of missing values. Let’s get rid of those while we explore the predictive power of these variables.
app_dwell = insufficient_history.filter(like='_MEDI') # Extract all the "median" variables related to dwelling
missing_idx = app_dwell.isnull().any(axis=1) # Get index for missing values
app_dwell = app_dwell.loc[-missing_idx,:] # Remove missing rows from the dwelling variables
app_target_dwell = app_target.loc[-missing_idx,:] # Remove missing rows from the target
app_dwell.shape(12969, 14)
We’re left with 14 columns and no missing values. We don’t need to normalize these values since they came to us already scaled. If this complete version of dwelling variables isn’t decently predictive of TARGET, there’s no good argument for including these in some way in a larger model.
Let’s see if these can be consolidated with PCA.
pca = PCA(n_components=app_dwell.shape[1]) # Define the PCA
app_docs_pca = pca.fit_transform(app_dwell) # Run the PCA
dwell_exp_var = pca.explained_variance_ratio_ # Get the explained variance for each principal component
print(dwell_exp_var[0:5]) # How much variance is explained by the first 5 components [0.57225323 0.16350268 0.05873402 0.04947089 0.02826078]
Okay! The PCA is helping quite a bit here. We’ve got 87% of the variance explained in the first 5 components. That’s great info!
This would be a great time for a visualization and showing off the power of plotnine, a grammar of graphics visualization package by the same folks who designed ggplot2 for R. Let’s visualize the proportion of variance captured by each principal component.
First, we need to make a dataframe with the principal components we have in a list above.
dwell_pca_df = (
pd.DataFrame({
"component": list(map(lambda x: f'x{x}', range(1, 15))),
"variance": dwell_exp_var
})
.assign(variance_fmt = lambda df: df["variance"].map(lambda x: f'{x * 100:.1f}%'))
.sort_values("variance", ascending = False)
.reset_index(drop = True)
)
component_list = dwell_pca_df["component"]
component_cat = pd.Categorical(dwell_pca_df["component"], categories = component_list)
dwell_pca_df = dwell_pca_df.assign(component = component_cat)
dwell_pca_df| component | variance | variance_fmt | |
|---|---|---|---|
| 0 | x1 | 0.572253 | 57.2% |
| 1 | x2 | 0.163503 | 16.4% |
| 2 | x3 | 0.058734 | 5.9% |
| 3 | x4 | 0.049471 | 4.9% |
| 4 | x5 | 0.028261 | 2.8% |
| 5 | x6 | 0.026983 | 2.7% |
| 6 | x7 | 0.025260 | 2.5% |
| 7 | x8 | 0.020493 | 2.0% |
| 8 | x9 | 0.016817 | 1.7% |
| 9 | x10 | 0.014299 | 1.4% |
| 10 | x11 | 0.011477 | 1.1% |
| 11 | x12 | 0.006752 | 0.7% |
| 12 | x13 | 0.003870 | 0.4% |
| 13 | x14 | 0.001827 | 0.2% |
from plotnine import ggplot, geom_col, geom_text, aes, labs, theme_minimal, theme, element_blank, element_text
(
ggplot(dwell_pca_df, aes("component", "variance", label = "variance_fmt"))
+ geom_col(fill = "#BE0000")
+ geom_text(va = "bottom")
+ labs(
title = "Proportion of variance by principal component",
subtitle = "Dwelling related variables",
x = "Component"
)
+ theme_minimal()
+ theme(
panel_grid_major_x = element_blank(),
panel_grid_major_y = element_blank(),
axis_title_y = element_blank(),
axis_title_x = element_text(margin = {"t": 20,"r": 0,"b": 0,"l": 0,"units": "pt"}),
axis_text_y = element_blank()
)
)Instead of using the principal components as the predictors, let’s use the original variables in a logistic regression model (I’m curious which, if any, are predictive of TARGET).
app_dwell_mod = sm.Logit(app_target_dwell, app_dwell).fit() # Fit a logistic regression
print(app_dwell_mod.summary()) # Print the model summaryOptimization terminated successfully.
Current function value: 0.257421
Iterations 8
Logit Regression Results
==============================================================================
Dep. Variable: TARGET No. Observations: 12969
Model: Logit Df Residuals: 12955
Method: MLE Df Model: 13
Date: Fri, 11 Oct 2024 Pseudo R-squ.: 0.01294
Time: 06:27:34 Log-Likelihood: -3338.5
converged: True LL-Null: -3382.2
Covariance Type: nonrobust LLR p-value: 4.195e-13
================================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------------------
APARTMENTS_MEDI -0.4194 1.160 -0.362 0.718 -2.693 1.854
BASEMENTAREA_MEDI 0.2009 0.679 0.296 0.767 -1.130 1.532
YEARS_BEGINEXPLUATATION_MEDI -2.1403 0.267 -8.022 0.000 -2.663 -1.617
YEARS_BUILD_MEDI 0.2295 0.374 0.614 0.539 -0.503 0.962
COMMONAREA_MEDI -0.0328 0.619 -0.053 0.958 -1.247 1.181
ELEVATORS_MEDI -0.6470 0.624 -1.038 0.299 -1.869 0.575
ENTRANCES_MEDI -1.3946 0.594 -2.348 0.019 -2.559 -0.231
FLOORSMAX_MEDI -2.0031 0.523 -3.828 0.000 -3.029 -0.977
FLOORSMIN_MEDI -0.3427 0.346 -0.990 0.322 -1.021 0.336
LANDAREA_MEDI 0.7573 0.493 1.537 0.124 -0.208 1.723
LIVINGAPARTMENTS_MEDI 0.1574 1.033 0.152 0.879 -1.867 2.182
LIVINGAREA_MEDI 1.5549 0.940 1.654 0.098 -0.288 3.398
NONLIVINGAPARTMENTS_MEDI -1.6355 1.460 -1.120 0.263 -4.496 1.225
NONLIVINGAREA_MEDI -0.7401 0.750 -0.987 0.324 -2.211 0.730
================================================================================================
At first blush, none of these variables, with exception of FLOORSMAX_MEDI and BASEMENTAREA_MEDI, are any good. The \(R^2\) value is poor and most every variable has too high of a p-value. If you remember, 40-71% of these dwelling variables were missing anyway. It’s probably best to exclude these entirely from future models.
It is possible, however, that a simple field for flagging what % of dwelling data is available would be insightful. We’ll look into that later.
Categorical variables
Let’s turn our attention to some of the discrete fields. For these, we’ll want to evaluate the proportion of TARGET that span the categories. For example, if the same proportion of men default as women, gender may not, on its own, have much predictive power. The Chi-Square test would be helpful for this, too.
Let’s design a function, where we feed the data, a variable name, and we get back the results of the test:
import numpy as np
def run_chi_sq(data, var):
# Get and print the cross tab
data_crosstab = pd.crosstab(data[var], data["TARGET"])
data_crosstab_prop = pd.crosstab(data[var], data["TARGET"], normalize='index')
print(data_crosstab_prop)
# Run the chi-square test
observed_vals = data_crosstab.to_numpy()
observed_table = sm.stats.Table(observed_vals)
chi_sq_stat = observed_table.test_nominal_association()
# Print chi-square test results
print("\n-----------------------------------")
print(f"Chi-square statistic: {chi_sq_stat.statistic}")
print(f"P-Value: {chi_sq_stat.pvalue}")This is also just a good way for us to get a sense for the unique values each of these fields entail, since skimpy hasn’t given us insight into that.
NAME_CONTRACT_TYPE
run_chi_sq(app_train_df, "NAME_CONTRACT_TYPE")TARGET 0 1
NAME_CONTRACT_TYPE
Cash loans 0.916541 0.083459
Revolving loans 0.945217 0.054783
-----------------------------------
Chi-square statistic: 293.53682101554654
P-Value: 0.0
CODE_GENDER
run_chi_sq(app_train_df, "CODE_GENDER")TARGET 0 1
CODE_GENDER
F 0.930007 0.069993
M 0.898581 0.101419
XNA 1.000000 0.000000
-----------------------------------
Chi-square statistic: 920.4804789807738
P-Value: 0.0
FLAG_OWN_CAR
run_chi_sq(app_train_df, "FLAG_OWN_CAR")TARGET 0 1
FLAG_OWN_CAR
N 0.914998 0.085002
Y 0.927563 0.072437
-----------------------------------
Chi-square statistic: 146.82528196185706
P-Value: 0.0
FLAG_OWN_REALTY
run_chi_sq(app_train_df, "FLAG_OWN_REALTY")TARGET 0 1
FLAG_OWN_REALTY
N 0.916751 0.083249
Y 0.920384 0.079616
-----------------------------------
Chi-square statistic: 11.624737274340792
P-Value: 0.0006508043252171358
NAME_INCOME_TYPE
run_chi_sq(app_train_df, "NAME_INCOME_TYPE")TARGET 0 1
NAME_INCOME_TYPE
Businessman 1.000000 0.000000
Commercial associate 0.925157 0.074843
Maternity leave 0.600000 0.400000
Pensioner 0.946136 0.053864
State servant 0.942450 0.057550
Student 1.000000 0.000000
Unemployed 0.636364 0.363636
Working 0.904115 0.095885
-----------------------------------
Chi-square statistic: 1251.8435948609447
P-Value: 0.0
NAME_EDUCATION_TYPE
run_chi_sq(app_train_df, "NAME_EDUCATION_TYPE")TARGET 0 1
NAME_EDUCATION_TYPE
Academic degree 0.981707 0.018293
Higher education 0.946449 0.053551
Incomplete higher 0.915150 0.084850
Lower secondary 0.890723 0.109277
Secondary / secondary special 0.910601 0.089399
-----------------------------------
Chi-square statistic: 1019.2131873088356
P-Value: 0.0
NAME_FAMILY_STATUS
run_chi_sq(app_train_df, "NAME_FAMILY_STATUS")TARGET 0 1
NAME_FAMILY_STATUS
Civil marriage 0.900554 0.099446
Married 0.924401 0.075599
Separated 0.918058 0.081942
Single / not married 0.901923 0.098077
Unknown 1.000000 0.000000
Widow 0.941758 0.058242
-----------------------------------
Chi-square statistic: 504.98913222135616
P-Value: 0.0
NAME_HOUSING_TYPE
run_chi_sq(app_train_df, "NAME_HOUSING_TYPE")TARGET 0 1
NAME_HOUSING_TYPE
Co-op apartment 0.920677 0.079323
House / apartment 0.922043 0.077957
Municipal apartment 0.914603 0.085397
Office apartment 0.934276 0.065724
Rented apartment 0.876869 0.123131
With parents 0.883019 0.116981
-----------------------------------
Chi-square statistic: 420.55618983894647
P-Value: 0.0
ORGANIZATION_TYPE
run_chi_sq(app_train_df, "ORGANIZATION_TYPE")TARGET 0 1
ORGANIZATION_TYPE
Advertising 0.918415 0.081585
Agriculture 0.895273 0.104727
Bank 0.948145 0.051855
Business Entity Type 1 0.918616 0.081384
Business Entity Type 2 0.914716 0.085284
Business Entity Type 3 0.907004 0.092996
Cleaning 0.888462 0.111538
Construction 0.883202 0.116798
Culture 0.944591 0.055409
Electricity 0.933684 0.066316
Emergency 0.928571 0.071429
Government 0.930219 0.069781
Hotel 0.935818 0.064182
Housing 0.920554 0.079446
Industry: type 1 0.889317 0.110683
Industry: type 10 0.935780 0.064220
Industry: type 11 0.913462 0.086538
Industry: type 12 0.962060 0.037940
Industry: type 13 0.865672 0.134328
Industry: type 2 0.927948 0.072052
Industry: type 3 0.893838 0.106162
Industry: type 4 0.898518 0.101482
Industry: type 5 0.931553 0.068447
Industry: type 6 0.928571 0.071429
Industry: type 7 0.919663 0.080337
Industry: type 8 0.875000 0.125000
Industry: type 9 0.933195 0.066805
Insurance 0.943049 0.056951
Kindergarten 0.929651 0.070349
Legal Services 0.921311 0.078689
Medicine 0.934155 0.065845
Military 0.948747 0.051253
Mobile 0.908517 0.091483
Other 0.923575 0.076425
Police 0.950021 0.049979
Postal 0.915624 0.084376
Realtor 0.893939 0.106061
Religion 0.941176 0.058824
Restaurant 0.882938 0.117062
School 0.940852 0.059148
Security 0.900216 0.099784
Security Ministries 0.951368 0.048632
Self-employed 0.898261 0.101739
Services 0.933968 0.066032
Telecom 0.923744 0.076256
Trade: type 1 0.910920 0.089080
Trade: type 2 0.930000 0.070000
Trade: type 3 0.896621 0.103379
Trade: type 4 0.968750 0.031250
Trade: type 5 0.938776 0.061224
Trade: type 6 0.954041 0.045959
Trade: type 7 0.905504 0.094496
Transport: type 1 0.955224 0.044776
Transport: type 2 0.921960 0.078040
Transport: type 3 0.842460 0.157540
Transport: type 4 0.907188 0.092812
University 0.951017 0.048983
XNA 0.946004 0.053996
-----------------------------------
Chi-square statistic: 1609.2406359645197
P-Value: 0.0
Predictably, given the sample size, these are all statistically significant proportions of TARGET across the categories.
The most interesting, and likely predictive, is NAME_INCOME_TYPE. Naturally, one’s propensity to default on a loan is often tied to their funds flow. Still, values like Student never defaulting is fascinating.
ORGANIZATION_TYPE is fascinating. There’s a lot of of values here which isn’t ideal for a model. Looking at the values, there’s clearly some that are similar and others that aren’t. Could be a good opportunity for a clustering algorithm.
Clustering organizations
Let’s first prepare the data. Let’s generate two features for each category:
- Proportion of observations mapped to default (
TARGETvalue of1) - Number of observatiosn total
That should be a decent input to a simple algorithm like Nearest Neighbors.
org_summary = duckdb.sql("""
SELECT
ORGANIZATION_TYPE
,SUM(TARGET) / COUNT(*) AS PROP_DEFAULT
,COUNT(*) AS TOTAL_OBS
FROM app_train_df
GROUP BY
ORGANIZATION_TYPE
""").fetchdf()
org_summary.head()| ORGANIZATION_TYPE | PROP_DEFAULT | TOTAL_OBS | |
|---|---|---|---|
| 0 | Other | 0.076425 | 16683 |
| 1 | Trade: type 7 | 0.094496 | 7831 |
| 2 | Medicine | 0.065845 | 11193 |
| 3 | Construction | 0.116798 | 6721 |
| 4 | Services | 0.066032 | 1575 |
Let’s now construct the nearest neighbors algorithm and apply it to these measures.
from sklearn.cluster import AgglomerativeClustering
from sklearn.neighbors import kneighbors_graph
org_X = org_summary.loc[:,["PROP_DEFAULT", "TOTAL_OBS"]].to_numpy()
knn_graph = kneighbors_graph(org_X, 5, include_self=False)
agg_clustering = AgglomerativeClustering(n_clusters=5, connectivity=knn_graph)
org_Labels = agg_clustering.fit_predict(org_X)
org_Labelsarray([0, 0, 0, 0, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2,
2, 2, 2, 4, 1, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2], dtype=int64)
With the clustering complete, we can now take these labels, add them back into the organization summary and get a sense for which types of organizations are similar.
org_summary["ORG_CLASS"] = org_Labels
duckdb.sql("""
SELECT
ORG_CLASS
,STRING_AGG(ORGANIZATION_TYPE, ',') AS ORG_TYPES
FROM org_summary
GROUP BY
ORG_CLASS
""").show()┌───────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ ORG_CLASS │ ORG_TYPES │
│ int64 │ varchar │
├───────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 2 │ Services,Industry: type 12,Industry: type 2,Transport: type 1,Agriculture,Industry: type 4,Trade: type 6,University,Industry: type 13,Trade: type 4,Bank,Security,Culture,Police,Trade: type 1,Housing,Industry: type 5,Religion,Military,Industry: type 11,Telecom,Industry: type 9,Electricity,Hotel,Advertising,Industry: type 6,Postal,Transport: type 2,Industry: type 3,Restaurant,Industry: type 1,Mobile,Cleaning,Legal Services,Industry: type 8,Trade: type 5,Trade: type 2,Trade: type 3,Security Ministries,Industry: type 7,Insurance,Transport: type 3,Emergency,Realtor,Industry: type 10 │
│ 0 │ Other,Trade: type 7,Medicine,Construction,Business Entity Type 2,Kindergarten,School,Government,Business Entity Type 1,Transport: type 4 │
│ 1 │ Business Entity Type 3 │
│ 3 │ Self-employed │
│ 4 │ XNA │
└───────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Looking at these results, it wasn’t an ideal clusting of these types since we still have most types concentrated in two classes. However, mathematically, these are fairly similar in their relationship with TARGET. We are seeing some benefit as this would be a far better predictor that the original ORGANIZATION_TYPE.
Conclusion
Summary
There’s plenty more to explore with these data sets. As we move on from the exploratory data analysis phase and into modeling, there will continue to be emphasis placed on discovering insights.
At this point, however, we’ve reached some key milestones:
- We have a very thorough understanding of the data
- What fields are available
- How each data set connects to each other
- We’ve explored who is this cohort of under served applicants
- What does “insufficient” credit history look like
- How many applications meet this definition
- How similar these sample of applications are to the population
- We also began exploration of predictive power
- What fields of the application are likely distractions
- How to reduce the dimensionality of continuous variables
- We also honed in on discrete values and the information they held
- The unique range of values present in categorical variables
- Approaches to clusting discrete variables for dimensionality reduction
Key Insights
Throughout this exploration of the data, a few key insights have surfaced:
- Data from past applications and credit bureau history applies to 6 of every 7 applications
- The remainder are those with “insufficient” history, defined as fewer than 3 previous records
- Applicantions with insufficient history are, as a whole, very similar to the population at large
DOCUMENTandDWELLINGrelated variables present little to no predictive power for default- A handful of categorical variables seem like promising candidates for high predictive power
- Namely:
NAME_INCOME_TYPE,NAME_EDUCATION_TYPE, andORGANIZATION_TYPE
- Namely: